11.07.08

Revising Parallax

Posted in Apple ][ at 1:16 pm by site admin

Just a quick note. I thought about what I wrote yesterday and it turns out that the trampoline code cannot work because the 1-byte fixed offset in the JML instruction does not given enough of a range to reach all 84 code blocks. This could be coded around by masking the JML address field with the baseAddr value, but that would slow things down.

Here is how the out-of-line code should work.

blitter:
   PEA $1234   ; push BG0 word
   LDA (02),y
   PHA         ; push BG1 word
   JMP $FXXX   ; Handle BG0/BG1 masking
   PEA $ABCD
   ...

FXXX:
   LDX <baseAddr         ; Get the address of this line
   LDA (OFFSET),y
   AND >XX0000+OFFSET,x
   ORA >YY0000+OFFSET,x
   PHA
   TXA
   LDX <xsave            ; Restore the X register
   ADC #OFFSET          ; compute the return address
   STA patch+1
patch:
   JMP $OFFSET           ; Return to the next instruction

This scheme is especially nice because the mask and data are stored in their respective banks at the same address of the instruction where they would have been put anyway. Obviously this makes the tile drawing code more efficient as well since it is already computing the address of the tile within the BG0 code buffer. If the tile is a BG0/BG1 mix, then it just has to store the data in a different bank.

Also, compared to the previous version there only needs to be 84 copies of this code fragment since it computes the correct return address on the fly. I’m planning to use the extra space in the bank to implement a cache of BG0/BG1 code fragments that are specialized for a single word. Their code is the optimal sequence:

   LDA (OFFSET),y
   AND #MASK
   ORA #DATA
   PHA
   JMP rtn

Obviously the offset, data, mask and return address are all filled in by the tile drawing code. There is enough room for about 400 of these cached code fragments and each one save 25 cycles. Since all of the cache entries are used before the generic code fragments, we are saving up to 400 * 25 = 10,000 cycles per frame. At 20 fps that represents almost 10% of the available CPU cycles.

There are a lot of other micro-optimizations I’m working on, too. For instance the inner blitter loop patches and restores the BG0 code buffer as each scan line is processed. I’m planning to change this to use a series of unrolled loops to amortize the cost over multiple lines. I also have a trick to significantly speeding up the odd-aligned blitter and amortize some of the other set up code over multiple scan lines. These changes could shave up to 60 cycles of overhead off of the inner loop and result in a 5-10% boost to the maximum FPS.

Leave a Comment

You must be logged in to post a comment.