Once upon a time, when things were slow at work and I was reading the Great Microprocessors page, I was inspired to do some CPU design. We've seen CISC, and then RISC, and even OISC, but how about NISC (no instructions)?
Let's say we have a processor that's made up of modules. Each module is mapped internally into a 256-wide address space (think of them as registers). The basic module is a copy device; it reads a value from one port and writes it to another.
Instructions would be two bytes wide. The address of the source port, and the address of the destination port.
There would be various modules available; occupying ports $0 to $F would be a simple device that merely returned the port number when read. Ports $10 to $20 would be latches; they return the last item written. Ports $20 to $27 would be shift registers, divided into four pairs. Each pair would consist of a data register and a reset register. Accessing the reset register zeroes the data register. Reading the data register would return the current value, but writing to it would shift the contents four bits left and or in the current value.
So, to load a constant $12A5 into latch #0, you'd use the following code:
00 21 reset shift reg #0 01 20 shift reg = 00000001 02 20 shift reg = 00000012 0A 20 shift reg = 0000012A 05 20 shift reg = 000012A5 20 10 copy shift reg into latch
Similarly, $28 and up would be an ALU organised the same sort of way.
28 data register 29 read: invert data reg write: add to data reg 2A read: negate data reg write: sub from data reg 2B write: shift data reg left 2C write: shift data reg right 2D access byte 0 of data reg 2E access byte 1 of data reg 2F access byte 2 of data reg 30 access byte 3 of data reg ...
r0 = r0 + 5
would be:
10 28 copy r0 into accumulator 05 29 add 5 28 10 copy back to latch
You might be able to improve efficiency by making $28 give the latch number to use as a data register, but that would require the ALU to have behind-the-scenes access to the latch module.
Memory access is done by modules mapped in at $80 to $FF. There's also a control bank in $78 to $7F. Each register in the control bank gives the base address of a 16-register bank. Each bank forms a window into memory.
So:
r0 = [$1234] + 5
would be:
00 21 reset shift reg #0 01 20 shift reg = 00000001 02 20 shift reg = 00000012 03 20 shift reg = 00000123 04 20 shift reg = 00001234 20 78 set bank 0 base to $1234 80 28 read [$1234] from mem into accumulator 05 29 add 5 28 10 copy into latch
Actually finding the code to execute is done by a module in $70 to $77:
70 Program counter 71 Write: add to 70 72 New program counter 73 Write: if 0, copy 72 to 70 74 Write: if non-0, copy 72 to 70 75 Write: if 0, add 72 to 70 76 Write: if non-0, add 72 to 70 77 Instruction latch
This is actually part of the memory access module, as it has internal access to bank $F0 to read instructions. When you write a value into $70, it updates the control register $7F to point to the nearest 16-quad boundary and reads the current quadword into $77. Then, every clock tick, the processor core will execute the bottom two bytes of $77 and shift it right two bytes, incrementing the program counter accordingly. When bit 1 of the program counter is 0, a new quad is loaded into $77. When the bottom six bits are zero (just passed a 16-quad boundary) it will update $7F.
Flow of execution is done by writing the desired new PC or offset to $72 and then writing a data value to $73-$76 according to the action to be taken.
if (r0 == 1) r0 == 0 else r0 == 1
. . . becomes:
10 28 r0 -> accumulator 01 2A subtract one 06 72 set target to six bytes ahead 28 74 if non-zero, jump 01 10 1 -> r0 02 71 jump forward two bytes 00 10 0 -> r0
Advantages: incredibly simple. Easily pipelinable (you can say that when using a hardware multiply module, the result is only available 32 clock ticks after triggering the operation; while this is in action you can do other things). Lends itself very nicely to a self-clocked architecture. Highly flexible (want more flexibility? another multiplier? a divider? DSP routines? simply plug in another module). Really nice memory access. Highly optimisable (you can sort of pipe instructions together using the accumulator; you'd only put a value in a latch if you wanted to keep it while performing some other operation). Aesthetically pleasing.
Disadvantages: awkward to program. These days, people use
compilers, so that's not much of a problem. Code's a bit bloated,
but not as much as I thought. On the ARM, the above example's
twelve bytes (three instructions) as opposed to my fourteen, and
that's using condition codes a lot. On the PowerPC you can push it
down to twelve bytes if you try hard but gcc produces 32 bytes
without optimisation. (With optimisation it's four bytes; xor
r0, r0, 1
. gcc is too clever for it's own good
sometimes.)
True random memory accessing is a bit slow, because you have to update the control bank, but you can access structures up to 16 quads long very easily. The biggest drawback, though, is that there's a huge amount of state. This would make quick context switching difficult. You'd need to save the state of all the latches, the ALU, the MMU, and all the other modules that may or may not be there; a huge amount of information, which not be easy to extract.
In effect, you get 16 general purpose registers (the latches), an accumulator (the ALU data register), and seven pointer registers (the memory access module control registers).
Is this good, or what? (Everyone choruses: what!)