11.12.08

A final (really!) architecture

Posted in Apple ][ at 9:50 am by site admin

I’ve had a bit of time to ponder over the past weekend and have come up with the definitive, absolutely final architecture for rendering in GTE. This whole train of thought began once I abandoned the line-by-line approach I’ve been using since the beginning for a more batch-oriented approach.

The core idea of the new optimizations is to move the calculations as far into the blitter as possible. Anything that is not critical to actually moving data to the SHR screen should be done outside of this code (preferably in a lazy, unrolled fashion). Since I want to try and draw multiple lines between enabling and disabling interrupts to amortize the “context switch” overhead, it makes sense to set things up to be hard coded and avoid having to reference any memory during the actual blitting.

Here is a summary of the changes

  • Move the BG1 address table into Bank 01
  • Move the Animated Tile data into Bank 01
  • Set the memory to R1/W1 rather than R0/W1 in the inner loop
  • Move the register initialization into the blitter core
  • Set the register initial values only when the global state changes

The net affect of these changes is that the core blitter code used to look like this

   jmp $0000
line_start:
   pea $1234
   pea $1234
   ...
   pea $1234      ; Repeat 84 times
   jmp line_start
   jmp $0000

where $0000 contained code that returned control to the inner blitter loop. The new code is set up to draw multiple lines without any excess overhead. The code looks like this.

   ldx #1234
   ldy #5678
   lda #9ABC
   tcs
   jmp $0000     ; jump into the entry point
   jmp next      ; jump to the next line
line_start:
   pea $1234
   pea $1234
   ...
   pea $1234      ; Repeat 84 times
   jmp line_start
   jmp next

next:
   ldx #1234
   ldy #5678
   lda #9ABC
   tcs
   ...

The beauty of this arrangement is that most of the hard-coded values that are loaded into X, Y, Entry Address and Accumulator do not change on every frame. The X register values are totally invariant, they never change. The Y register values only change if BG1 is scrolled vertically. The Accumulator values only change if the screen is physically repositioned on the SHR graphics screen, and the Entry Point address only change if BG0 is scrolled horizontally.

So now, the inner blitter loop need only patch out a single JMP instruction to return control to the main loop. I figure that we should be able to draw 8 lines at a time without impacting interrupt latency very much. Notice that the per-line overhead (code that is not moving data to the graphics screen) is only 20 cycles. The amortized overhead of the inner loop will add another 15 to 20 cycles of overhead. The net result is a significant boost in the framerate.

How much, you ask? Well, the SMB Demo runs at a resolution of 256×160. I had previously benchmarked the blitter at around 22 frames per second. My theoretical calculations give a maximum framerate of 30.01 frames per second. Now, we’ll obviously never hit that in practice, but I’m pretty sure that we’ll see a 10 to 15 percent boost in the framerate.

All of these changes, along with the optimizations that come from drawing the sprites in their entirety, rather than line-by-line, should provide much smoother gameplay.

Leave a Comment

You must be logged in to post a comment.