Once upon a time, when things were slow at work and I was reading the Great Microprocessors page, I was inspired to do some CPU design. We've seen CISC, and then RISC, and even OISC, but how about NISC (no instructions)?

Let's say we have a processor that's made up of modules. Each module is mapped internally into a 256-wide address space (think of them as registers). The basic module is a copy device; it reads a value from one port and writes it to another.

Instructions would be two bytes wide. The address of the source port, and the address of the destination port.

There would be various modules available; occupying ports $0 to $F would be a simple device that merely returned the port number when read. Ports $10 to $20 would be latches; they return the last item written. Ports $20 to $27 would be shift registers, divided into four pairs. Each pair would consist of a data register and a reset register. Accessing the reset register zeroes the data register. Reading the data register would return the current value, but writing to it would shift the contents four bits left and or in the current value.

So, to load a constant $12A5 into latch #0, you'd use the following code:

  00 21  reset shift reg #0
  01 20  shift reg = 00000001
  02 20  shift reg = 00000012
  0A 20  shift reg = 0000012A
  05 20  shift reg = 000012A5
  20 10  copy shift reg into latch

Similarly, $28 and up would be an ALU organised the same sort of way.

  28     data register
  29     read: invert data reg write: add to data reg
  2A     read: negate data reg write: sub from data reg
  2B     write: shift data reg left
  2C     write: shift data reg right
  2D     access byte 0 of data reg
  2E     access byte 1 of data reg
  2F     access byte 2 of data reg
  30     access byte 3 of data reg
  ...

r0 = r0 + 5 would be:

  10 28  copy r0 into accumulator
  05 29  add 5
  28 10  copy back to latch

You might be able to improve efficiency by making $28 give the latch number to use as a data register, but that would require the ALU to have behind-the-scenes access to the latch module.

Memory access is done by modules mapped in at $80 to $FF. There's also a control bank in $78 to $7F. Each register in the control bank gives the base address of a 16-register bank. Each bank forms a window into memory.

So:

r0 = [$1234] + 5 would be:

  00 21  reset shift reg #0
  01 20  shift reg = 00000001
  02 20  shift reg = 00000012
  03 20  shift reg = 00000123
  04 20  shift reg = 00001234
  20 78  set bank 0 base to $1234
  80 28  read [$1234] from mem into accumulator
  05 29  add 5
 28 10  copy into latch

Actually finding the code to execute is done by a module in $70 to $77:

  70     Program counter
  71     Write: add to 70
  72     New program counter
  73     Write: if 0, copy 72 to 70
  74     Write: if non-0, copy 72 to 70
  75     Write: if 0, add 72 to 70
  76     Write: if non-0, add 72 to 70
  77     Instruction latch

This is actually part of the memory access module, as it has internal access to bank $F0 to read instructions. When you write a value into $70, it updates the control register $7F to point to the nearest 16-quad boundary and reads the current quadword into $77. Then, every clock tick, the processor core will execute the bottom two bytes of $77 and shift it right two bytes, incrementing the program counter accordingly. When bit 1 of the program counter is 0, a new quad is loaded into $77. When the bottom six bits are zero (just passed a 16-quad boundary) it will update $7F.

Flow of execution is done by writing the desired new PC or offset to $72 and then writing a data value to $73-$76 according to the action to be taken.

if (r0 == 1)
  r0 == 0
else
  r0 == 1

. . . becomes:

  10 28  r0 -> accumulator
  01 2A  subtract one
  06 72  set target to six bytes ahead
  28 74  if non-zero, jump
  01 10  1 -> r0
  02 71  jump forward two bytes
  00 10  0 -> r0

Advantages: incredibly simple. Easily pipelinable (you can say that when using a hardware multiply module, the result is only available 32 clock ticks after triggering the operation; while this is in action you can do other things). Lends itself very nicely to a self-clocked architecture. Highly flexible (want more flexibility? another multiplier? a divider? DSP routines? simply plug in another module). Really nice memory access. Highly optimisable (you can sort of pipe instructions together using the accumulator; you'd only put a value in a latch if you wanted to keep it while performing some other operation). Aesthetically pleasing.

Disadvantages: awkward to program. These days, people use compilers, so that's not much of a problem. Code's a bit bloated, but not as much as I thought. On the ARM, the above example's twelve bytes (three instructions) as opposed to my fourteen, and that's using condition codes a lot. On the PowerPC you can push it down to twelve bytes if you try hard but gcc produces 32 bytes without optimisation. (With optimisation it's four bytes; xor r0, r0, 1. gcc is too clever for it's own good sometimes.)

True random memory accessing is a bit slow, because you have to update the control bank, but you can access structures up to 16 quads long very easily. The biggest drawback, though, is that there's a huge amount of state. This would make quick context switching difficult. You'd need to save the state of all the latches, the ALU, the MMU, and all the other modules that may or may not be there; a huge amount of information, which not be easy to extract.

In effect, you get 16 general purpose registers (the latches), an accumulator (the ALU data register), and seven pointer registers (the memory access module control registers).

Is this good, or what? (Everyone choruses: what!)