Two years ago (really? Two years?) I did some writeups on performing arithmetic on the 6502 and Z80 processors in the context of my Cowgol compiler. Well, I'm now working on a code generator for the 6303, so it's time for another one!
Updated: added some stuff, rewrote some stuff which was wrong.
The 6303 is a member of the venerable 6800 processor series by Motorola, dating from 1983. It's supports the 6800 instruction set plus some extremely useful extensions to do 16-bit operations. It's a second-source reimplementation by Hitachi of the 6803, and Hitachi took the opportunately to add a few more useful instructions. Both the 6803 and the 6303 are very much cousins to the 6809, which did something similar but in a different way, and which still remains one of the best 8-bit CPUs of all time. They were typically clocked at 1 or 2MHz: compare with the 6502's 2MHz or the Z80's 4MHz.
It's got three registers: A, B, and X. A and B are accumulators and are eight bit wide, and can also be combined to form a sixteen bit accumulator, D. X is the index register, and is also sixteen bits wide. Nearly all operations work on A or B (or D); there's a very limited set of operations you can perform on X. There's a decent set of both eight and sixteen bit operations, which means that working in these sizes is a relative no-brainer. There's even a hardware 8x8=16 multiply instruction.
Instructions are mostly 2-op, where one parameter is always an accumulator and the other is in memory. The result is always an accumulator. Some instructions operate on only registers but they're exceptions to the rule. There's an addressing mode which provides fast access to zero page, like on the 6502.
Let's try some examples.
Note that Cowgol's a 3op memory machine, so variables live in memory and are only cached in registers. So, all my examples below start and end with values in memory, so that I can compare like with like. In real life, values from one expression will remain in registers for the next, and so won't need to be reloaded; the examples are all deliberately choosing the worst possible case.
Although: since the last two posts, Cowgol's been rewritten entirely, from the ground up, and it's now smaller, simpler, produces better code, and is better in every way. It's also much less of a 3op memory machine than the old version was. For consistency I'd better stick with 3op examples, unfortunately.
As with the other posts in this series, please take note that this document is me thinking out loud, and I'm not an expert on the 6303. Bugs ahoy!
⇡Simple 8-bit arithmetic
Let me try
3 96 YY 00 ldaa y 4 BB ZZ 00 adda z 4 B7 XX 00 staa x (11 cycles, 9 bytes.)
This is almost identical structure-wise to the 6502 (10 bytes, 14 cycles), but is quite different to the Z80 (11 bytes, 47 cycles) due to the Z80's limitations on addressing modes and loads and stores.
Again, in real life I'd most likely want to keep the result in a register, and so the final store or the initial load wouldn't be required. Working with constants is trivial, too.
Indirecting through pointers requires the address to be loaded into the X register. So,
3 CE QQ 00 ldx #q 4 E6 00 lda 0, x 3 CE RR 00 ldx #r 4 AB 00 adda 0, x 3 CE PP 00 ldx #p 4 A7 00 sta 0, x (21 cycles, 15 bytes.)
Indirecting through X gives a free addition, which I'm not using in this case, and so I'm always passing in 0. As always, real life won't be this bad.
The equivalent code on the 6502 is surprisingly nasty, requiring the pointers to be set up ahead of time in zero page and then they get dereferenced using Y as an index register, which must be set to zero. It's easier on the Z80 which allows dereferencing through any sixteen bit register (although I'm penalised if I use any register other than HL for this).
But so far, this actually looks pretty good.
Good news! The 16-bit version is exactly the same!
4 DC YY 00 ldd y 5 D3 ZZ 00 addd z 4 DD XX 00 std x (13 cycles, 9 bytes.)
This compares extremely favourably to the 6502's 26 cycles and 19 bytes, or the Z80's 63 cycles and 11 bytes.
addd have immediate and indexed addressing modes, too. Unfortunately, I can't push or pop D directly to or from the stack, instead being required to work with A and B seperately (3 cycles, 2 bytes), but this is almost ideal, 16-bit-arithmetic-on-a-8-bit-cpu-wise. So, let's move on.
This is where things start becoming interesting. The problem is that the 6303 only has four bytes of register, including X. There's not enough space to hold both operands or the result in memory. The CPU handbook actually has an example, and it suggests using XD as a 32-bit accumulator and keeping the other parameter in memory:
5 D3 RR 02 addd rhs+2 2 18 xgdx # swaps D and X 3 C9 RR 01 adcb rhs+1 3 99 RR 00 adca rhs+0 2 18 xgdx # swaps D and X back again (15 cycles, 11 bytes.)
This shows one problem: there's a 16-bit
addd instruction, but there's no corresponding
addd sets the carry, but does not honour it. Therefore, I have to do the high half of the add with two individual 8-bit
The result's not bad but it requires the use of X. This means that if I need to use X for something else, this all horribly breaks down and I have to use a different algorithm entirely.
Given I'm using memory-to-memory examples everywhere else, I should show the naive example for
4 DC YY 02 ldd y+2 5 D3 ZZ 02 addd z+2 4 DD XX 02 std x+2 4 DC YY 00 ldd y+0 3 C9 ZZ 01 adcb z+1 3 99 ZZ 00 adca z+0 4 DD XX 00 std x+0 (27 cycles, 21 bytes.)
This looks pretty bad; but it compares favourably to the 6502 and Z80. It's also easily adaptable to using immediate constants. It doesn't use X, but changing it to use a pointer is pretty nasty, as every pointer needs loading into X twice (once for the low word and once for the high word).
Normally at this point I would investigate using clever tricks such as a loop... but I can't. I am required to use A and B to do the arithmetic. This just leaves X for the loop counter, which it's actually well suited to as there's a
DEX instruction which leaves the carry bit untouched, but I still need to modify the address being read; and, unlike the 6502, the 6303 limits the offset being added to the index register to 8 bits. So, where on the 6502 I can do
X to read the Xth byte of the value at
address, I can't do that on the 6303 unless
address is in zero page.
The rather startling conclusion here, or at least startling to me, is that the naive unrolled version of the arithmetic is the only reasonable version.
⇡Not addition or subtraction
What about operations which don't have a native 16-bit operation? For example,
4 DC YY 02 ldd y+2 3 D8 ZZ 03 eorb z+3 3 98 ZZ 02 eora z+2 4 DD XX 02 std x+2 4 DC YY 00 ldd y+0 3 D8 ZZ 01 eorb z+1 3 98 ZZ 00 eora z+0 4 DD XX 00 std x+0 (28 cycles, 24 bytes.)
At this point there shouldn't be any surprises. The implementation limits me to the obvious code. That's not necessarily a disadvantage, but I can't help feeling a little disappointed. On the 6502, the shortest version of this is a loop, which can cram things down to 14 bytes; but more cycles. The Z80 requires 14 bytes for a register-to-register xor, but then also requires an additional 32 bytes to read and write the 32-bit values.
The big problem so far is that the examples all start in memory and end in memory. This is unrealistic, particularly with the new version of Cowgol. Most likely I'm going to want to operate on the result again before writing it to a variable (or doing something else with it).
This means I want to try and split the operations above into three: load the LHS into the accumulator, operate on the accumulator, and write the result back. This way, two operations can be pipelined by leaving out the write/read stages.
This is easy for 16-bit code. Consider
4 DC QQ 00 ldd q 5 D3 RR 00 addd r 5 D3 SS 00 addd s 4 DD XX 00 std p (18 cycles, 12 bytes.)
The intermediate result is left in D at all times. But it's not so easy for 32-bit code. The handbook example suggested using XD as an accumulator:
4 DE QQ 00 ldx q+0 # LHS hi 4 DC QQ 02 ldd q+2 # LHS lo 5 D3 RR 02 addd r+2 2 18 xgdx 3 C9 RR 01 adcb r+1 3 99 RR 00 adca r+0 2 18 xgdx 5 D3 SS 02 addd s+2 2 18 xgdx 3 C9 SS 01 adcb s+1 3 99 SS 00 adca s+0 2 18 xgdx 4 DF PP 00 stx p+0 # dest hi 4 DD PP 02 std p+2 # dest lo (46 cycles, 34 bytes.)
This looks a bit gruesome, but it's only 15 cycles and 11 bytes per additional addition. The problems arise when I need to use X for pointer dereferencing. With DX in use, I have no working registers remaining; my only option is to push something onto the stack; this has to be X, as pushing D is expensive. The naive approach is this:
4 3C pshx 3 CE PP 00 ldx #ptr 6 E3 02 addd 2, x 5 58 pulx 2 18 xgdx 4 3C pshx 3 CE PP 00 ldx #ptr 4 E9 01 adcb 1, x 4 A9 00 adca 0, x 5 58 pulx 2 18 xgdx (42 cycles, 18 bytes)
That's taking almost as much time as our entire additional addition (those pushes and pops are expensive); surprisingly, the code size isn't too bad. Alternatively, we could use a temporary variable in zero page, which is both fast and small to access:
4 DF T1 stx temp1 # temp1 = old acc high word 3 CE PP 00 ldx #ptr 6 E3 02 addd 2, x 4 DD T2 std temp2 # temp2 = new acc low word 4 DC T1 ldd temp1 4 E9 01 adcb 1, x 4 A9 00 adca 0, x 2 18 xgdx # move new acc high word to X 4 DC T2 ldd temp2 # reload new acc low word to D (35 cycles, 18 bytes)
The big advantage here is that we don't have to reload X. It's no smaller, but definitely cheaper. However, the code generator will never actually want to do this. In this situation, the code being compiled is something like
[z]. By the time it comes to do the addition, the pointer value of
z has already been evalatued and is in a register. This will by necessity have evicted XD. So, we're actually going to want to add something which is in a temporary with a value dereferenced via X. This becomes:
; on entry: X is a pointer to the RHS and the LHS is in temp1 4 DC T1+2 ldd temp1+2 # low word of LHS 6 E3 02 addd 2, x 4 DD T1+2 std temp1+2 4 DC T1+0 ldd temp1+0 # high word of LHS 4 E9 01 adcb 1, x 4 A9 00 adcb 0, x 2 18 xgdx # high word of result to X 4 DC T1+0 ldd temp1+0 # low word of result to D (26 cycles, 15 bytes)
This looks small because it's not including the extra cost of evicting the LHS, which is a
ldd pair; eight cycles, four bytes. This puts the total to 34 cycles and 19 bytes, which is comparable to the non-eviction case.
We also need fragments which will load and save XD via a pointer too. This is equally fiddly. To save:
4 3C pshx 3 CE PP 00 ldx #ptr 5 ED 02 std 2, x 4 32 pula 4 33 pulb 5 ED 02 std 0, x (25 cycles, 10 bytes)
It would be possible to use temporaries here, too, which would use one extra byte but shave off one cycle. Loading actually lets us use a different, cheaper strategy:
3 CE PP 00 ldx #ptr 5 EC 02 ldd 2, x 5 EE 00 ldx 0, x (13 cycles, 7 bytes)
From a code generation perspective, this is all actually straightforward, as it means that all I have to do is to glue code fragments together. But it would be nice to have better ways of doing this. The 6303 is very constrained on registers; having just one extra 16-bit register would make all the difference, and in fact, that's what the 6811 did.
But working with the code above makes something else very obvious: the indexed and direct addressing modes are small. They take only one byte of payload, versus two for a full address. Consider our original 32-bit
z. If all three variables are on a stack frame and X is pointing at the stack frame base, it becomes:
5 EC YY ldd y+2, x 5 E3 ZZ addd z+2, x 5 ED XX std x+2, x 5 EC YY ldd y+0, x 4 E9 ZZ adcb z+1, x 4 A9 ZZ adca z+0, x 5 ED XX std x+0, x (33 cycles, 14 bytes.)
That's up from 27 to 33 cycles, but down from 24 to 14 bytes --- this code is nearly half the size, but only six cycles slower. The processor handbook's recommended code for
mem is 15 cycles and 11 bytes, so this code is three bytes longer... and also manages to write the result back to memory!
The disadvantage is that we have to have X always pointing at the stack frame. This means that, at least for 32-bit arithmetic, we can't use XD as a 32-bit accumulator. Variables which are not in a stack frame have to be accessed by direct address, of course.
What about our
s case? Because nothing's cached in a register, it's just the above code, twice. So, 66 cycles, 28 bytes. Our optimised register-centric version above is 46 cycles and 34 bytes. Using indexing in this case is quite a lot slower, but also quite a lot smaller.
Dereferencing pointers? These must use X for the pointer; so, we end up having to evict the stack frame pointer. It's most efficient to load any input parameters into temporaries before the computation, and write them back after, so we can For
5 EC QQ ldd q, x 4 DD TQ std TQ 5 EC RR ldd r, x 4 DD TR std TR 5 EC PP ldd p, x 5 DD TP std TP 4 DE TQ ldx TQ 5 EC 02 ldd 2, x 4 DE TR ldx TR 5 E3 02 addd 2, x 4 DE TP ldx TP 5 A9 00 std 2, x 4 DE TP ldx TQ 5 EC 00 ldd 0, x 4 DE TR ldx TR 4 E9 01 adcb 1, x 4 A9 00 adca 0, x 4 DE TP ldx TP 5 ED 00 std 0, p 3 CE FF FF ldx #<stack frame> (88 cycles, 41 bytes.)
That's... ehh. But the version using XD above is 36 bytes. So that's very little worse, and rather simpler. This is also the most pathological case; in real life it's rare to have pointers on every side of an operation.
There are, by the way, savings for 16 and 8 bit arithmetic too.
So the moral here appears to be not to try to be clever.
⇡Some other random stuff
There's a 8-bit multiplication instruction. This is very convenient for 8-bit arithmetic, but requires some work for 16-bit arithmetic. It is possible to do a 16-bit multiplication (16x16=16) using three 8x8=16 multiplies. For
3 96 YY 00 ldaa y+0 3 D6 ZZ 00 ldab z+0 10 3D mul 5 FD XX 00 std x+0 3 96 YY 01 ldaa y+1 3 D6 ZZ 00 ldab z+0 10 3D mul 3 DB XX 01 addb x+1 4 F7 XX 01 stab x+1 3 96 YY 00 ldaa y+0 3 D6 ZZ 01 ldab z+1 10 3D mul 3 DB XX 01 addb x+1 4 F7 XX 01 stab x+1 (67 cycles, 23 bytes.)
This (if I got it right, which I may well not have) does a long multiplication in base 256. Sadly, there's no division instruction, so that has to happen the slow way.
It's possible to add B to A, but not A to B. Also, it's possible to add B to X, but not D to X --- and there's also no way to subtract B from X.
There are 16-bit wide left shift and logical right shift instructions (although just one bit at a time). There is no 16-bit wide arithmetic right shift instruction.
The 6303 is capable of performing both signed and unsigned comparisons, natively. Say it with me: allelujah!
There is a BRN 'branch never' instruction, which even the documentation points out is completely useless.
Branch instructions have 8-bit relative offsets, making them compact, but there are no long versions, so making the compiler writer's job much harder. There is also a
BSR 8-bit relative subroutine call instruction.
The 6303 is a elegant, simple, quite fast, and almost completely unknown processor. I have no idea why: the 6809 came out the same year, so maybe it was eclipsed by that? Maybe it was thought of as uninteresting because it was a reimplementation of the 6803 from 1974?
I've seen 6303s multiple times as embedded controllers --- they have a lot of useful internal peripherals, plus 128 bytes of RAM, so all you need is an external program ROM and some I/O for a basic system. (There was also a 6301 with an internal ROM, so you could even do without that.) And yet they're so little talked about that even finding the year of release was hard.
I don't know. This seems unfair; I really rather like this thing. It needs more registers, and it needs a bit more orthogonality (
PULD instructions!), but you could build a very nice microcomputer around this. I do know that it was used in the Psion Organiser, the very first PDA from 1984.
Anyway, now I know how 32-bit arithmetic works, I need to get back to the compiler implementation...