## ⇡Introduction

So I've been working on a compiler called Cowgol. This is intended to produce code for very small 8-bit machines, primarily the 6502 (because I grew up on BBC Micros, and also because I'm apparently a masochist). This is turning out to be rather hard.

Main Cowgol page

The problem is that the 6502 is inherently an 8-bit device, and only supports 8-bit arithmetic. Trying to do anything wider involves breaking the operation into smaller pieces, which is expensive --- while 6502 instructions are three bytes long, tops, I need a lot of them to do anything. It takes the 6502 ten to twenty bytes to do something which its closest rival, the Z80, could do in about five (and which an 8086 can do in two). And I only have 64kB to play with, so it's important to be efficient.

This document is me thinking out loud about how this is going to work.

Important note: this is actually the second version of this post --- the first one had a hideous and embarrassing error where my loops all counted in the wrong direction, because I had mentally conflated subtraction and comparison. This has been fixed. (The compiler gets this right.)

Update, 2018-06-15: the page got posted to HackerNews and I got a tonne of comments (and 35,000 hits). I've added a section at the bottom discussing them.

## ⇡The core problem

My compiler is a memory-memory 3op machine. Its backend generates instructions which essentially look like a := b + c where a, b and c are all variables in memory. Operands can be dereferenced or offsetted.

Let me consider the operation a := b + c.

For the purposes of example, I'll assume that my three variables are at 0xaaa0, 0xbbb0 and 0xccc0 respectively. I then wrote out the naive version of the code. (The first column is the cycle count, the second contains the bytes of machine code, the third is the instructions themselves.)

2   18         clc
4   ad b0 bb   lda b+0
4   6d c0 cc   adc c+0
4   8d a0 aa   sta a+0
4   ad b1 bb   lda b+1
4   6d c1 cc   adc c+1
4   8d a1 aa   sta a+1
(26 cycles, 19 bytes.)

That's kind of ludicrous.

Things get a little better if I'm working with constants, because they're cheaper. a := b + 0x123 becomes:

2   18         clc
4   ad b0 bb   lda b+0
4   8d a0 aa   sta a+0
4   ad b1 bb   lda b+1
4   8d a1 aa   sta a+1
(22 cycles, 17 bytes.)

But things get worse when I have to deal with dereferences. The 6502 can only handle pointers in zero page, and they must be indexed with the Y register. So a := b + c[0] becomes:

2   18         clc
4   ad b0 bb   lda b+0
2   a0 01      ldy #0
5   71 c0      adc (c), y
4   8d a0 aa   sta a+0
4   ad b1 bb   lda b+1
2   a0 01      ldy #1
5   71 c0      adc (c), y
4   8d a1 aa   sta a+1
(32 cycles, 21 bytes.)

I can actually save one byte by replacing that last ldy #0 with a dey --- I know that the old value was 1, so decrementing it will give the desired value of 0 --- but I'm going to leave it like this for simplicity.

The problem here is that there's only one Y register, so if different operands need different offsets, I have to keep reloading it. Imagine I have to do a[0] := b[2] + c[4]. What does that turn into?

2   18         clc
2   a0 02      ldy #2
5   b1 b0      lda (b), y
2   a0 04      ldy #4
5   71 c0      adc (c), y
2   a0 00      ldy #0
6   91 a0      sta (a), y
2   a0 03      ldy #3
5   b1 b0      lda (b), y
2   a0 05      ldy #5
5   71 c0      adc (c), y
2   a0 01      ldy #1
6   91 a0      sta (a), y
(46 cycles, 25 bytes.)

## ⇡Simple loops

The obvious thing is to say: there's a lot of repeated work here, can't I use a loop? And, in fact, I can. But there are problems.

The biggest problem is that multibyte arithmetic has to be processed in LSB to MSB order, with the carry propagating from one byte to the next; but on the 6502 it's so much easier to count down than it is to count up... for example, this is a loop counting down:

ldy #4
.label
...do work here...
dey
bpl label

But here is a loop counting up:

ldy #0
.label
...do work here...
iny
cpy #4
bne label

The difference is that when counting down, we get the N flag set automatically when the number becomes negative, so the comparison is free --- terminating when the index passes 0 is easy. But when counting up, we need an explicit comparison. And that comparison instruction trashes the carry flag, which we're using as part of our computation...

I could save the carry on the stack, like this:

clc
php
ldy #0
.label
plp
...do work here...
php
iny
cpy #4
bne label
plp

This does, indeed, work fine. But php and plp aren't that cheap, being 3 and 4 cycles respectively, and needing to set up and tidy afterwards is annoying.

Once I did the maths I came to the counter-intuitive conclusion that it's actually cheaper to use both index registers, and have one counting up for use when offsetting, and one counting down to tell us when we've finished the loop. So it becomes:

clc
ldx #4
ldy #0
.label
...do work here...
iny
dex
bpl label

If I use this loop to implement my very first example, a := b + c, then I get this:

2   18         clc
2   a2 01      ldx #1
2   a0 00      ldy #0
.label
4   89 b0 bb   lda b, y
4   79 c0 cc   adc c, y
5   99 a0 aa   sta a, y
2   c8         iny
2   ca         dex
2   10 f5      bpl label
(45 cycles, 18 bytes)

My original was 26 cycles and 19 bytes --- my improved version is only one byte shorter, and almost half the speed!

And it's worse. I can only use the loop because all three operations (the lda, adc and sta) use the same Y register. The loop doesn't work for the a[0] := b[2] - c[4] case at all.

The only way to achieve this is to emit code to compute aa := &a[0], bb := &b[2], cc := &c[4] and then compute aa[0] := bb[0] - cc[0] using a loop. But each of those is an addition in its own right, so I'm actually doing four additions in place of my original one.

This, in fact, what my compiler currently does. I did this partly because it made the code simpler, but also because I wasn't expecting to need to do this much, and memory pressure is a real problem; but in real life, it happens all the time: consider length := event.end_time - event.start_time; the destination is an ordinary variable, but the two operands are pointer dereferences at different offsets. And it turns out that the extra computations aren't just a net loss, they're a significant dead loss. My code would be faster and smaller if I did it the long way!

## ⇡Conclusion

So it appears that trying to shorten arithmetic using loops generally isn't worth it:

• it's limited to only those cases where the indices for all the operands are the same, which rules out a lot of common situations
• it's only really worth it for operations which can be computed from MSB to LSB
• even when it is worth it, it's not worth it very much, and it's significantly slower

And yet... it is still worth it, in some situations. And I am very tight on memory. (Anything bigger than 16 bits, for example, pretty much has to be done with a loop. Imagine a 32-bit addition down longhand.)

What cunning tricks are there?

## ⇡Negative offsets

I didn't think this one up, sadly. I wish I had; it's clever.

If all the operands are absolute addresses, then I start with a negative offset and count up to zero (and remember, I get the comparison with zero for free). The end result looks like this:

2   18         clc
2   a0 fe      ldy #-2
.label
4   89 b2 bb   lda b+2, y
4   79 c2 cc   adc c+2, y
4   99 a2 aa   sta a+2, y
2   c8         iny
2   30 f4      bmi label
(36 cycles, 15 bytes.)

(For comparison, the longhand version is 25 cycles, 19 bytes.)

The first time round, Y is -2. The addresses in the instructions are incremented by 2, so the overall real offset is 0. The next time round, Y is -1. Then we increment Y to 0, it stops being negative, and the loop exits.

This allows me to do our addition without needing the other index register to keep track of how many iterations I've had, which saves about 20% of bytes.

The downside is that this won't work for pointers. All three operands must be absolute addresses, so the offset can be baked into them.

## ⇡The return of the X register

Because this allows us to count up to 0 using a single register, we can use the other register for a different index.

Consider a := b - c[7].

2   18         clc
2   a0 01      ldx #-2
2   a2 08      ldy #7
.label
4   bd b2 bb   lda b+2, x
4   71 c0      adc (c), y
4   9d a2 aa   sta a+2, x
2   c8         iny
2   e8         inx
2   30 f6      bmi label
(42 cycles, 17 bytes.)

For comparison, here it is longhand.

2   18         clc
4   ad b0 bb   lda b+0
2   a2 08      ldy #y
4   71 c0      adc (c), y
4   8d a0 aa   sta a+0
4   ad b1 bb   lda b+1
2   c8         iny
4   71 c0      adc (c), y
4   8d a1 aa   sta a+1
(30 cycles, 20 bytes.)

The loop is three bytes shorter --- that's pretty marginal. And we're limited in which addressing modes me use; at least one operand must use absolute addressing, so we can use the negative index for it and bake in the adjustment factor.

Of course, if we have more than one pointer, and they use different offsets, it doesn't work.

## ⇡The last resort

The ultimate in code size is, of course, not to use machine code at all. At one point I did this:

20 xx yy  jsr __add16
a0 aa     equw &a
b0 bb     equw &b
c0 cc     equw &c
(Many cycles, 9 bytes.)

What the helper function does is to pop the return address of the stack, copy the next six bytes into a parameter area, perform the computation (remember that the parameters here all point at the destination variables), and then push the updated return address back onto the stack.

It worked! And it was pretty dense code, although you need to bear in mind that the cost in having to compute pointers made it totally not worth it. But it was creakingly slow. Simply copying the values off the stack was pretty bad; something like this:

; pop return address into temporary
pla
pla

; copy parameters
ldy #5
.label
sta (parameters), y
dey
bpl label

; increment return address to skip parameters
clc
bcc dont_increment_high_byte
.dont_increment_high_byte

jsr actually_perform_the_computation

; return
jmp (address)

So it's actually spending more time preparing the computation than it is performing it.

This works. Don't do it unless you really have to (and if you do, you probably want to use a bytecode interpreter instead).

## ⇡Success through cheating

If I'm willing to redefine the problem, I get to solve it differently. This can lead to substantial savings...

Let's say that I want to do a := a + 1. The 6502 actually allows cheap in-place increments without using registers at all. The code ends up like this:

6   ee a0 aa   inc a+0
2   d0 03      bne label
6   ee a1 aa   inc a+1
.label
(14 cycles maximum, 8 cycles minimum, 8 bytes.)

It's super cheap! And I don't need to save any registers! But inc only supports absolute addresses, so I can't use this when indirecting pointers.

This works because inc sets the Z flag of the low byte based on the result; since I've incremented it, I know therefore that if the result is zero it must have overflown from 0xff to 0x00 and I need to increment the high byte as well. And of course, if I don't have to increment the low byte, I don't execute the second inc instruction at all, so making the common case even cheaper. (And I can chain this technique for wider values, at the cost of five bytes per eight bits of arithmetic.)

Sadly, I can't use this trick for decrements. In this case, I need to decrement the high byte when the low byte underflows from 0x00 to 0xff. There isn't an automatic CPU check for this, so I need to rearrange the code:

4   ae a0 aa   ldx a+0
2   d0 03      bne label ; test needs to happen *before* the decrement
6   ce a1 aa   dec a+1   ; don't need to do the decrements in order
.label
6   ce a0 aa   dec a+0
(18 cycles maximum, 13 minimum, 11 bytes.)

It's more expensive than the increment, but not by much, and it's chainable for longer operations. (Thanks to vardump@hackernews for suggesting this; it's a lot better than my previous code.)

## ⇡Concluding conclusion

So I think that the most obvious option is to just use longhand code. I suspect the extra flexibility in not being limited to particular combinations of addressing modes is worth it.

But, that said, I do need to keep an eye out for common instruction patterns. If my compile generates a := b + c a lot --- and I suspect it does --- then the negative offset trick above will let me save 4/19 = 20% of my code; which is not to be sniffed at.

But it's probably only worth doing this for certain common instruction patterns.

I remain open to convincing. Please comment if you think I've missed anything!

Disclaimer: all code fragments in this post are for entertainment value only. Minor bugs are entirely deliberate, honest.

## ⇡Postultimate conclusion

### ⇡Why not use stack addressing / a bytecode interpreter / SWEET-16?

Yes, a lot of the 6502 code verbosity can be solved by using an interpreter. Stack operations are great for dense code, but even a well-chosen register based instruction set can provide good code density.

When talking about the 6502, the best example of this is Steve Wozniak's SWEET-16, which is a tiny bytecode interpreter built in to the Apple II ROM (and copied extensively); it's a miracle of tiny code, being only 372 bytes. It supports a simple instruction set with sixteen 16-bit registers, and it's intended for doing non-performant tasks like initialisation or application shells. It achieves its size mainly by careful selection of instructions, so that the side effects of executing the instruction become useful (e.g. dereferencing a pointer increments it), reducing wasted work.

Porting SWEET-16

A collection of articles (including a disassembly) discussing SWEET-16 and how it works.

The downside of all these approaches is speed. SWEET-16, fast though it is, runs at about a tenth of the speed of native 6502 code. Plus, as the entire goal of Cowgol is to produce native machine code, they don't help me.

The 6502 has no cache or pipeline, which means it's a great candidate for self-modifying code --- I can modify the instructions I'm currently executing and be sure that my changes will actually be picked up.

Unfortunately... it turns out that modifying self-modifying code is really expensive. Assuming I have a routine in zero page, it still costs three cycles to write a byte there (and I have to get the byte from somewhere, which costs cycles too). The amount of work needed to patch values into an extended addition routine outweighs the amount of work done to just done the addition; particularly when I take into account the extra work needed to call the routine (jmp is three cycles, jsr and rts are six). It's actually cheaper not to use.

Self-modifying code is most useful (IMO) for storing variables. Bogomandel, an ultra-fast Mandelbrot generator for the 6502, uses it extensively.

Bogomandel

A writeup of Bogomandel and what makes it tick.

By storing values inside instruction operands, I can not only save a cycle or two when looking them up, but I can save a byte or two of space:

2   a9 10      lda #0x10
4   8d a1 aa   sta loop+1
.loop
2   a9 ff      lda #0xff ; payload here overwritten by loop
...do something here...
6   ce a1 aa   dec loop+1
2   d0 xx      bne loop

But it needs careful planning (I obviously couldn't use this code in ROM) and I don't think it helps my arithmetic; the cost savings are very limited (one cycle!).

### ⇡Do I have to store numbers little-endian?

No, I don't.

Obviously values which are used as pointers do have to be little-endian, because that's how the 6502 expects to read them. But there's nothing to stop me storing everything else in big-endian byte order, with the MSB at the lowest address and the LSB at the highest. This allows my arithmetic loops to count down (which is cheap) while still processing data from LSB to MSB.

The biggest downside is that my compiler now needs to have two completely different sets of arithmetic routines; one for little-endian, and one for big-endian, and it needs to decide which representation to use. So, lots of added complexity. Still totally worth doing if I ever need to shave a few more cycles off, however.