Good news! After literally years of (very on and off work), I've just release Cowgol 2.0, my very small language for very small systems! It's simpler, smaller, less complex, and is better in every way.

The main Cowgol page

Documentation and downloads for the Cowgol compiler.

But for now, I'd like to talk about a subject which is very dear to my heart in these times and which has been troubling me: jump tables.

Problem statement

The Cowgol language has a switch statement. It looks like this:

case i is
  when 2: print("two!");
  when 3: print("three!");
  when 4: print("four!");
end case;

The implementation is very naive, and essentially turns it into this:

if i == 2 then
  print("two!");
elseif i == 3 then
  print("three!");
elseif i == 4 then
  print("four!");
end if;

The Cowgol compiler itself uses case statements a lot, especially inside the parser and code generator. There are hundreds of when statements.

It seems intuitively obvious, which means that it's obvious until I thought about it, that replacing these long if-chains with jump tables would lead to a substantial code size improvement. Think of all the overhead that I could save!

So, imagine my frustration when it turned out to substantially increase the size of the code (as well as make the compiler itself more complex).

An if-chain

Let's look at that example above on the 8080.

The naive implementation is actually a bit smarter than a simple if-chain, as it manages to keep the value in a register throughout and avoid having to reload it each time. It looks like this. (Interesting instructions are annotated with the size in bytes.)

  entry:
    lda i        # load the value being checked into a
  item_2:
2   cpi 2        # is this a 2?
3   jne item_3   # if not, skip this clause
    ...          # execute the print("two!")
    jmp exit     # ...and skip to the end
  item_3:
2   cpi 3        # is this a 3?
3   jne item_4   # etc
    ...
    jmp exit
  item_4:
2   cpi 4
3   jne exit
    ...
  exit:

So, there's the initial setup cost of loading i into register A. That's always going to be there, so let's ignore it. Then there's the cost of checking A for the current value, which is five bytes. And finally there's the cost of jumping to the end of the block once the clause I want has been executed; this is always going to be there, so let's ignore that too.

This means that this code has a cost per item of five bytes.

The 8080 is fundamentally an 8-bit machine with a few 16-bit opcodes. Luckily, I can use one of these to do a 16-bit comparison, so the 16-bit version of the same routine looks like this.

  entry:
    lhld i       # load the value of i into HL
    xchg         # swap HL and DE
  item_2:
3   lxi h, -2    # load -2 into HL
1   dad d        # HL := HL + DE
1   mov a, h     # move high byte of HL into A
1   or l         # or with low byte of HL
3   jne item_4   # if not zero, skip this clause
    ...
    jmp exit
  (skipped for brevity)

I'm using the 8080's ability to add two 16-bit registers together to add the negative of the number I want to test for to the register. Then, I logical-or the two bytes together to test if it's zero or not.

This ends up with a cost per item of nine bytes, which seems ridiculous. Surely using jump tables can improve on that. (Especially with 32-bit values --- the cost per item there is an agonising eighteen bytes!)

Jump tables, at a first glance

This is what I was expecting from the jump table:

  entry:
    lda i
    (load and jump to the right slot here)
  table:
2   dw address_of_2_clause
2   dw address_of_3_clause
2   dw address_of_4_clause
  address_of_2_clause:
    ...
    jmp exit
  address_of_3_clause:
    ...
    jmp exit
  address_of_4_clause:
    ...
  exit:

Provided the indices that I'm testing for are contiguous, which they are in this case (2, 3, 4), I can calculate the offset into the jump table with (index-minimum) * @bytesof intptr. This won't work for sparse indices, of course. I still need to use if-chains for those. Also, the code above needs to check for an out of bounds index, that is, one which isn't in the jump table, and branch to exit if it finds one. (Actually in real life case statements typically have a when else clause for this.)

The beauty of this is that it doesn't matter how wide my index is. Even for 32-bit values, the jump table remains an array of two-byte addresses, so the cost per item remains two. Two is a lot better than 18, or 9, or even 5.

Right?

You can't avoid the overhead

What I didn't realise, because the if-chain implementation didn't have it, is that there's actually two types of overhead to take into account.

What I have so far is per-item cost, which is the only one the if-chain implementation had.

What I'm not taking into account is the setup cost, which is the size of the code needed to actually work the jump table. This happens once per case statement.

What's the setup cost for an 8080 jump table? For a byte, it looks like this:

  entry:
    lda i        # load the index
2   sui 2        # the first item has the value 2
2   cpi 3+1      # there are three items
3   jc exit      # if the value is out of bounds, exit
1   mov l, a     # move the index to the low byte of HL
2   mvi h, 0     # load the high byte of HL with zero
1   dad h        # add HL to itself to double it
3   lxi d, table # load the address of the jump table
1   dad h        # add the doubled index to the jump table address
1   mov a, m     # load the low byte of the routine address
1   inx h        # increment HL
1   mov h, m     # load the high byte of the routine address into HL
1   mov l, a     # move the low byte into HL
1   pchl         # finally, jump to the routine

That's 20 bytes! (It's possible to reduce this by using helper routines, but the same logic applies even if the setup code is smaller.)

But I only need this once per case statement, right? And the per-item cost for the if-chain implementation is five bytes, but for this it's two bytes, so I'm saving three bytes per item. So, provided my case statement has more than 20/3 = 6 items, then I still come out ahead. Right?

It turns out that most case statements in the compiler are either non-contiguous or have fewer than six items.

Trying to have it both ways

After figuring this out I swore a lot, and then added some logic to fall back to the if-chain if there weren't enough items in the jump table to make it worthwhile. The result was still bigger than the original code.

To explain this, I need to talk about the implementation a bit. The Cowgol compiler is a one-pass statement-at-a-time compiler: it sees a statement, it generates code for it, it moves on to the next statement. It tries not to keep anything in memory (because there is usually hardly any). This means it has to emit machine code in the same order it sees it in the source file.

So, for our case statement, which looks like this:

case i is
  when 2: print("two!");
  when 3: print("three!");
  when 4: print("four!");
end case;

...then it has to produce the three print statements in that order. It can't buffer them to emit them later.

This means that when generating the code for the case statement, then if I want to generate a jump table I have to do this after the clauses have been compiled --- because the compiler doesn't know how many clauses there are until that point. This isn't a problem with an if-chain, because the comparison code gets emitted before each when statement. But for a jump table, then the compiler have to record which clauses have which values, so I can construct the jump table at the end.

The consequence of this is that even if the compiler doesn't want to generate a jump table, it still has to generate the comparison logic at the end. It doesn't know whether a jump table is worth it until all the when statements have been seen.

So, now my if-chain looks like this:

  entry:
    lda i
3   jmp start
  item_two:
    ...
    jmp exit
  item_three:
    ...
    jmp exit
  item_four:
    ...
3   jmp exit
  start:
    cpi 2
    je item_two
    cpi 3
    je item_three
    cpi 4
    je item_four
  exit:

It's the same code, in a different order, but there are two extra instructions. One is to jump from the beginning of the case statement to the actual logic, and the other is to jump from the last item to the end of the case statement --- in my original if-chain implementation, the last item could just fall off the end.

This is adding an extra six bytes of setup cost, even if I'm not actually using a jump table. That's worth one and a bit when statements.

Given that the compiler, which is the program I'm most interested in, mostly has very short case statements, the consequence is that trying to use jump tables at all causes worse code --- even if the program never contains a jump table! Combine this with the 200-300 extra bytes of compiler needed for the jump table logic and the result is a net loss.

It gets worse

I've been using the 8080 as an example processor for a reason, which is that it has fixed-size jump instructions. Let's consider the 6502 instead.

The same case statement on the 6502 looks just the same.

  entry:
    lda i        # load the value being checked into a
  item_2:
2   cmp #2       # is this a 2?
2   bne item_3   # if not, skip this clause
    ...          # execute the print("two!")
    jmp exit     # ...and skip to the end
  item_3:
2   cmp #3       # is this a 3?
2   bne item_4   # etc
    ...
    jmp exit
  item_4:
2   cmp #4
2   bne exit
    ...
  exit:

The difference is that the branch instructions are only two bytes long. The 6502 and the Z80 both have optimised short-distance jumps, which the 8080 doesn't: if the target address is within -128..127 bytes of the current address, then there's a two-byte branch instruction available.

This reduces the cost per item from five bytes on the 8080 to four bytes, provided the code in the when clause is short enough. This is nearly always the case. So even in an ideal world, the jump table's two bytes per item cost is competing with a four bytes per item cost that has no overhead.

Out of interest, here's what a jump table implementation would look like on the 6502:

  entry:
    lda i          # load the value being checked into a
1   sec            # set carry
2   sbc #2         # subtract minimum
2   cmp #3+1       # bounds check
2   bcc exit       # exit if out of bounds
1   tax            # copy A to X
3   lda tablehi, x # load high byte of routine address
1   pha            # push to stack
3   lda tablelo, x # load low byte of routine address
1   pha            # push to stack
1   rts            # pop address from stack and jump

That's 17 bytes. Including the six from above due to having the entry code after the jump table gives a total of 23, which is the equivalent of nearly six if-chain when statements.

At this point it's just not worth even trying any more.

What can be done

There are, in fact, some optimisations which can be done.

One thing is that because each when clause in an if-chain has its own comparison, I can optimise the comparison for the value. On the 8080, a 16-bit comparison looked like this:

3   lxi h, -2    # load -2 into HL
1   dad d        # HL := HL + DE
1   mov a, h     # move high byte of HL into A
1   or l         # or with low byte of HL
3   jne item_4   # if not zero, skip this clause

The value we're actually testing for is 2, which is small (having a high byte of zero). This code can be rewritten as:

2   mvi a, 2     # load 2 into A
1   sub e        # subtract the low byte of the index
1   or d         # logical or with the high byte
3   jne item_4   # if not zero, skip this clause

That's seven bytes, saving two bytes. It doesn't seem like much, but 16-bit case statements predominantly use small values, and it adds up quickly. Importantly, it's also easy to implement.

This kind of optimisation is particularly true for 32-bit comparisons:

2   mvi a, 2     # load 2 into A
1   sub e        # subtract the low byte of the index
1   or d         # logical or with the high byte
1   or c         # ...and with the high word in BC
1   or b
3   jne item_4   # if not zero, skip this clause

That's nine bytes, which is half the 18 bytes a 32-bit comparison would normally take.

The 6502 isn't amenable to this trick because it can't keep a 32-bit value in registers. Here, the index value has to be written to memory, and compared like this:

2   ldy #-4        # loop counter
  loop:
3   lda index+4, y # load a byte of the index
3   cmp const+4, y # compare with a byte of the constant
2   bne item_4     # if not the same, skip this clause
1   iny            # increment y
2   bne loop       # repeat loop

That 13 bytes, plus four bytes for the constant, making 17 bytes per item --- actually one byte fewer than on the 8080, but still nasty. Here, what I've done is this:

3   jsr _when4     # call a helper routine
4   dl 2           # the value being compared against
2   bne item_4     # if not the same, skip this clause

The helper routine pops the return address from the stack, compares it against the index (which is at a known memory location), advances the return address and returns with the flags set. This reduces the cost per item to nine bytes, at the expense of an annoyingly complex 40-byte helper routine. That is, at least, included once per program, so you only need five 32-bit when clauses throughout the program to make it worth having.

Final thoughts

So the moral is, believe in yourself and... no, wait. The moral is that what's obvious isn't necessarily true, and that simpler is frequently better than what's actually better.

I'm now wondering if there are other optimisations I've added which are counterproductive. Maybe I should start removing rules from the code generator and seeing if the generated code gets better...