11.05.08
Revisiting Parallax
I previously wrote about changing the blitter to support pixel-perfect parallax scrolling by allowing for variable-length blitter code fragments. I justified this approach on the grounds that the overhead was reasonably amortized over each block and the improvement in quality of the output more than made up for any overhead.
Well, as is often the case when I venture to think, I was wrong.
The problem is not so much in the blitter code itself (although we will address that in a bit), rather it is the additional overhead imposed by increasing the storage requirements of the blitter code buffer. I lost the ability to patch using a single BRA instruction, the code buffer is split up among multiple banks so every offset calculation becomes more complicated, there is no guarantee of consecutive lines being in the same bank, etc, etc, etc. The can of worms was quite large.
Fortunately, after a bit of pondering, I have an improved solution!
The core constraint that makes the current set up so nice and fast is that each code fragment for transferring a word of graphics data to the SHR screen is exactly 3 bytes. What we want is a 3-byte code fragment that can take care of the masking between BG0 and BG1. Of course there is no such code sequence, so we must get a bit clever.
I had thought about breaking out of the blitter code to handle these complicated cases before, but I never came up with an acceptable solution. Any sort of JSR/JSL instruction couldn’t work because the stack is mapped to the graphics screen. When the processor pushed the return address on the stack, we would get graphical corruption. No, the only way out is an unconditional jump. But then there is no way to get back……or is there?
Here is the code fragment I have in mind. I’ll explain in detail afterward.
blitter: PEA $1234 ; push BG0 word LDA (02),y PHA ; push BG1 word JMP $FXXX ; Handle BG0/BG1 masking rtn: PEA $ABCD ... FXXX: LDA <baseAddr ; Get the base address for this line's code fragments STA >*+5 ; Patch the high word of the JML instruction JML $0000XX ; Jump to the proper code fragment 0000XX: LDA (04),y AND #MASK ORA #DATA PHA JML >rtn ; Return to the next instruction
Ok, time for some explaining. First, one has to realize that the blitter code buffer only takes up about 55,000 bytes of memory. So, there is almost 10K of memory free near the top of the bank. Our plan is to fill the area with a total of 84 trampoline functions — one for each possible offset. These function combine the base address of the current blitter scan line with a fixed offset in order to dispatch to a small code fragment in an arbitrary memory bank. There is one code fragment for every possible word in the blitter, so we exploit this 1:1 relationship to hard-code a JML that returns to the address immediately following the initial JMP instruction.
The trampoline codes only take up a total of 840 bytes, leaving nearly 9,000 bytes free. We can put that memory to good use by caching some of the BG0/BG1 masking code in the same bank as the code buffer, thus saving the trampoline code and a exchanging a JML for a JMP instruction. Ideally, we would like to use this cache dynamically so that if there were fewer than a couple of hundred BG0/BG1 maskig operations they would always use this shortcut. It is easy to fill this cache, but would require a fair amount of work to remember which cache locations are in use by each block so they can be freed once a new block is drawn.
All in all, I like this solution quite a bit. It preserves the 3-byte structure of the BG0 code buffer and only adds extra overhead for precisely those operations that require more work. The use of trampoline code adds about 20 cycles of overhead, but that is more than acceptable given the amount of overhead saved by keeping the implementation simple. This approach also generalizes well since the trampoline dispatch effectively removes any memory constraints. This opens the door for some of the more sophisticated layering methods I’ve talked about.
Finally, we gain the possibility of caching a subset of the BG0/BG1 code fragments, which lowers the overhead to only 6 cycles (only 2 JMP instructions). Perhaps the potential savings of about 10,000 cycles per frame is worth a few hundred cycles of overhead when drawing a new tile. It probably is.
This represents the final piece of puzzle for completing GTE to the level of quality that I always envisioned. Now, if I could just find some free time actually implement these ideas rather than just think about them.