I’ve just released version 0.4 of Cowgol, my new Ada-inspired programming language for the 6502 on the 6502! Now with much better code generation!

Main Cowgol page

See the main Cowgol page for more information.

The code generation’s actually not bad now, provided you bear in mind that the code generator is dumb as rocks and intended to run on an 8-bit machine; it’s sufficiently good that despite the new code generation being more complex than the old one (I rewrote big chunks of it), the entire compiler is now a tenth smaller than the previous version. And it runs much faster, too.

Here’s an example; this is the print() routine in Cowgol:

sub print(ptr: [int8])
    loop
        var c: int8 := ptr[0];
        if c == 0 then
            return;
        end if;
        print_char(c);
        ptr := ptr + 1;
    end loop;
end sub;

And here’s the generated code, fresh from the disassembler, annotated:

L098F:  ldy     #$00
        lda     ($40),y  --- ptr
        sta     $E93A    --- c
        cmp     #$00
        bne     L099B
        rts
L099B:  lda     $E93A    --- c
        sta     $E93F    --- print_char's parameter
        jsr     L097D    --- print_char
        inc     $40      --- ptr lo
        bne     L09AA
        inc     $41      --- ptr hi
L09AA:  jmp     L098F
        rts

That’s pretty okay. Things the code generator can’t do are: remembering flag status across instructions, which is why the cmp #$00 is there when it doesn’t need to be; and remembering the value of registers between basic blocks. That requires global register analysis and that’s really hard and memory intensive, so it’s much easier not to bother.

Here’s another, less satisfying example. This is find_first_bit(), which is used a lot in the code generator.

Source code:

sub find_first_bit(value: uint8): (mask: uint8)
    mask := 1;
    loop
        if (value & mask) != 0 then
            return;
        end if;
        if mask == 0x80 then
            break;
        end if;
        mask := mask << 1;
    end loop;
    mask := 0;
end sub;

Disassembly:

L30C1:  lda     #$01
        sta     $E8F9  --- mask
L30C6:  lda     $E8F8  --- value
        and     $E8F9  --- mask
        sta     $E8FA  --- temporary
        cmp     #$00
        bne     L30D6
        jmp     L30D7
L30D6:  rts
L30D7:  lda     $E8F9  --- mask
        cmp     #$80
        bne     L30E1
        jmp     L30EB

L30E1:  lda     $E8F9  --- mask
        asl     a
        sta     $E8F9  --- mask
        jmp     L30C6
L30EB:  lda     #$00
        sta     $E8F9  --- mask
        rts

So the order of the blocks is rather poor — there’s several places where it’s branching over a jump. It’d be much cheaper to invert the branch and eliminate the jump completely; replacing bne L30D6; jmp L30D7; L30D6:rts; L30D7: with a simple bne <address of the next ret instruction>. In the general case the compiler doesn’t do any kind of basic block analysis, so code has to be emitted in the order it’s seen, so when it sees the branch it can’t tell where it’s going. When the if is processed, it doesn’t know the next statement is a return.

Some of this will be helped with a peephole optimiser, but this kind of analysis is something the compiler will never do well.

The other issue is that there’s a lot of loading and storing to expensive 16-bit addresses. The compile model requires variables to live in memory — the registers are used soley as a cache — but one thing it would be nice to do is to promote commonly-used variables into zero page. This would make the code not just smaller but faster. This snippet came from codegen, the main code generator from the live compiler; it’s currently using 76 zero page bytes for variables which absolutely have to live in zero page, which leaves loads of space. Unfortunately, to do properly this would require the compiler to keep statistics on how often variables are used, which is more complex and memory-consuming than you’d think (codegen has 958 variables).

There’s also some trivial optimisations: the anonymous temporary at $E8FA is never read from and doesn’t need to be written; lda $E8F9; asl a; sta $E8F9 could be optimised to asl $E8F9… compiler optimisation goes on forever, but there’s other stuff I want to do now, so I’m not going to follow these up.