I’ve just released version 0.4 of Cowgol, my new Ada-inspired programming language for the 6502 on the 6502! Now with much better code generation!
The code generation’s actually not bad now, provided you bear in mind that the code generator is dumb as rocks and intended to run on an 8-bit machine; it’s sufficiently good that despite the new code generation being more complex than the old one (I rewrote big chunks of it), the entire compiler is now a tenth smaller than the previous version. And it runs much faster, too.
Here’s an example; this is the
print() routine in Cowgol:
sub print(ptr: [int8]) loop var c: int8 := ptr; if c == 0 then return; end if; print_char(c); ptr := ptr + 1; end loop; end sub;
And here’s the generated code, fresh from the disassembler, annotated:
L098F: ldy #$00 lda ($40),y --- ptr sta $E93A --- c cmp #$00 bne L099B rts L099B: lda $E93A --- c sta $E93F --- print_char's parameter jsr L097D --- print_char inc $40 --- ptr lo bne L09AA inc $41 --- ptr hi L09AA: jmp L098F rts
That’s pretty okay. Things the code generator can’t do are: remembering flag
status across instructions, which is why the
cmp #$00 is there when it
doesn’t need to be; and remembering the value of registers between basic
blocks. That requires global register analysis and that’s really hard and
memory intensive, so it’s much easier not to bother.
Here’s another, less satisfying example. This is
is used a lot in the code generator.
sub find_first_bit(value: uint8): (mask: uint8) mask := 1; loop if (value & mask) != 0 then return; end if; if mask == 0x80 then break; end if; mask := mask << 1; end loop; mask := 0; end sub;
L30C1: lda #$01 sta $E8F9 --- mask L30C6: lda $E8F8 --- value and $E8F9 --- mask sta $E8FA --- temporary cmp #$00 bne L30D6 jmp L30D7 L30D6: rts L30D7: lda $E8F9 --- mask cmp #$80 bne L30E1 jmp L30EB L30E1: lda $E8F9 --- mask asl a sta $E8F9 --- mask jmp L30C6 L30EB: lda #$00 sta $E8F9 --- mask rts
So the order of the blocks is rather poor — there’s several places where
it’s branching over a jump. It’d be much cheaper to invert the branch and
eliminate the jump completely; replacing
bne L30D6; jmp L30D7; L30D6:rts;
L30D7: with a simple
bne <address of the next ret instruction>. In the
general case the compiler doesn’t do any kind of basic block analysis, so
code has to be emitted in the order it’s seen, so when it sees the branch it
can’t tell where it’s going. When the
if is processed, it doesn’t know the
next statement is a
Some of this will be helped with a peephole optimiser, but this kind of analysis is something the compiler will never do well.
The other issue is that there’s a lot of loading and storing to expensive
16-bit addresses. The compile model requires variables to live in memory —
the registers are used soley as a cache — but one thing it would be nice to
do is to promote commonly-used variables into zero page. This would make the
code not just smaller but faster. This snippet came from
codegen, the main
code generator from the live compiler; it’s currently using 76 zero page
bytes for variables which absolutely have to live in zero page, which leaves
loads of space. Unfortunately, to do properly this would require the compiler
to keep statistics on how often variables are used, which is more complex and
memory-consuming than you’d think (
codegen has 958 variables).
There’s also some trivial optimisations: the anonymous temporary at
is never read from and doesn’t need to be written;
lda $E8F9; asl a; sta
$E8F9 could be optimised to
asl $E8F9… compiler optimisation goes on forever,
but there’s other stuff I want to do now, so I’m not going to follow these up.