04.10.08
The Pursuit of Pixel Perfection
So the alliterations in these posts are probably wearing thin at this point, but please bear with me for a bit. During my biweekly flight to the West Coast this week (thank goodness I didn’t fly American Airlines!) I had some time to contemplate the limitations of GTE compared to the graphical prowness of SNES-level hardware. One of the biggest technical limitations of the current design is that the selection between BG0 and BG1 can only be done on a word basis. So, as a designer, you are left with the limitation that every group of 4 pixels must be totally in one of the two layers.
This limitation has always bothered me for two reasons, 1) I don’t like limitations, and 2) it makes the graphics look really ugly. I’ve always wanted to have pixel-level parallax granularity for the final release of GTE, and now I think I know how to pull it off. First, I should explain the reason behind the current limitation.
When GTE blits the playing field to the graphics screen, it does so with zero overdraw by executing a single code buffer that contains either a PEA $1234 instruction for drawing BG0 data, or a LDA (00),y PHA pair that copies dat from the BG1 data bank. These instructions take up 3 bytes and the size of the code buffer is 84 words wide. This is the minimum width necessary for supporting scrolling with 16×16 tiles since the graphics screen is 80 words width (320 pixels) and an extra block is needed to draw both “edges” of the playing field.
So, 84 words times three bytes per instruction results in a 252 bytes of code per line just to copy data. There is a JMP instruction at the end of the code that jumps back to the beginning of the line, so that brings the total to 255 bytes. In order to draw the correct portion of each line, GTE patches each code line at runtime to branch out at the appropriate time. Currently, we use a BRA instruction since it only requires a single load and store to patch. The BRA instruction barely has enough range to branch out of the code line. If the code were extended by just one byte, it will no longer work.
In order to get the pixel perfect parallax I’m after, I’m willing to make some compromised in the implementation of the blitter, but there is one iron-clad rule: do not slow down the fast path! Whatever runs quickly now, should run quickly in the new design, too. With that in mind, let’s proceed.
My plan is break up the code buffer to be more block-oriented. The code will be split up among multiple banks and a total of 9 bytes will be allocated per word, although most bytes will be unused. In addition to the PEA and LDA/PHA instruction sequences, I need to introduce the following code sequence for the cases where a BG0 word is partially transparent.
lda (00),y and #MASK ora #DATA pha
Good. 9 bytes, nothing complicated. This is just the standard “mask and draw” method of blitting. The real work coming in dealing with the previously fast instruction which now must contend with 6 empty bytes. The solution is to simply add a branch instruction to skip to the next code fragment. This may work, but is costly — the cycle count for a solid BG0 word jumps from 5 o 8 cycles. Not good!
Instead of handling the alignment on a word basis, we’ll do it on a block basis. For 4×4 blocks this reduces to the same thing, but if you’re using 4×4 blocks, performance is probably not your most critical concern. If we consider a 16 pixel wide block, then we can pack 4 word fragments together and only add padding at the end if needed. For example:
pea $1234 lda (02),y pha lda (04),y and #$00FF ora #$7800 pha pea $9ABC bra next
Not bad! We’ve only added an additional 3 cycles of overhead per 4 words, less than a cycle per word, which is more than compensated for by the fact that I make sure my direct page accesses are page-aligned.
Overall, I’m pretty happy with the design on paper and will give it a shot for an implementation. There is some additional overhead, of course. The tile blitters have to do extra work in order to update some internal tables that identify the address of each block in a line. Dispatching to and from a code buffer that spans multiple banks may require some 24-bit address manipulation, but nothing too serious. Most importantly, the graphical quality of the engine is improved and the performance does not suffer. Using complex BG0 masks will slow things down, of course. But the execution is directly tied to complexity of the scene, which is about as good as one can hope for.