I’ve just released version 0.4 of Cowgol, my new Ada-inspired programming language for the 6502 on the 6502! Now with much better code generation!
See the main Cowgol page for more information.
The code generation’s actually not bad now, provided you bear in mind that the code generator is dumb as rocks and intended to run on an 8-bit machine; it’s sufficiently good that despite the new code generation being more complex than the old one (I rewrote big chunks of it), the entire compiler is now a tenth smaller than the previous version. And it runs much faster, too.
Here’s an example; this is the print()
routine in Cowgol:
sub print(ptr: [int8])
loop
var c: int8 := ptr[0];
if c == 0 then
return;
end if;
print_char(c);
ptr := ptr + 1;
end loop;
end sub;
And here’s the generated code, fresh from the disassembler, annotated:
L098F: ldy #$00
lda ($40),y --- ptr
sta $E93A --- c
cmp #$00
bne L099B
rts
L099B: lda $E93A --- c
sta $E93F --- print_char's parameter
jsr L097D --- print_char
inc $40 --- ptr lo
bne L09AA
inc $41 --- ptr hi
L09AA: jmp L098F
rts
That’s pretty okay. Things the code generator can’t do are: remembering flag
status across instructions, which is why the cmp #$00
is there when it
doesn’t need to be; and remembering the value of registers between basic
blocks. That requires global register analysis and that’s really hard and
memory intensive, so it’s much easier not to bother.
Here’s another, less satisfying example. This is find_first_bit()
, which
is used a lot in the code generator.
Source code:
sub find_first_bit(value: uint8): (mask: uint8)
mask := 1;
loop
if (value & mask) != 0 then
return;
end if;
if mask == 0x80 then
break;
end if;
mask := mask << 1;
end loop;
mask := 0;
end sub;
Disassembly:
L30C1: lda #$01
sta $E8F9 --- mask
L30C6: lda $E8F8 --- value
and $E8F9 --- mask
sta $E8FA --- temporary
cmp #$00
bne L30D6
jmp L30D7
L30D6: rts
L30D7: lda $E8F9 --- mask
cmp #$80
bne L30E1
jmp L30EB
L30E1: lda $E8F9 --- mask
asl a
sta $E8F9 --- mask
jmp L30C6
L30EB: lda #$00
sta $E8F9 --- mask
rts
So the order of the blocks is rather poor — there’s several places where
it’s branching over a jump. It’d be much cheaper to invert the branch and
eliminate the jump completely; replacing bne L30D6; jmp L30D7; L30D6:rts;
L30D7:
with a simple bne <address of the next ret instruction>
. In the
general case the compiler doesn’t do any kind of basic block analysis, so
code has to be emitted in the order it’s seen, so when it sees the branch it
can’t tell where it’s going. When the if
is processed, it doesn’t know the
next statement is a return
.
Some of this will be helped with a peephole optimiser, but this kind of analysis is something the compiler will never do well.
The other issue is that there’s a lot of loading and storing to expensive
16-bit addresses. The compile model requires variables to live in memory —
the registers are used soley as a cache — but one thing it would be nice to
do is to promote commonly-used variables into zero page. This would make the
code not just smaller but faster. This snippet came from codegen
, the main
code generator from the live compiler; it’s currently using 76 zero page
bytes for variables which absolutely have to live in zero page, which leaves
loads of space. Unfortunately, to do properly this would require the compiler
to keep statistics on how often variables are used, which is more complex and
memory-consuming than you’d think (codegen
has 958 variables).
There’s also some trivial optimisations: the anonymous temporary at $E8FA
is never read from and doesn’t need to be written; lda $E8F9; asl a; sta
$E8F9
could be optimised to asl $E8F9
… compiler optimisation goes on forever,
but there’s other stuff I want to do now, so I’m not going to follow these up.