Download sources for this chapter


This is probably the least exciting chapter in the series. Feel free to skip it, unless you are into random numbers, in which case you can just skip to the final section about random numbers. The first section might be of some relevance as well, where we create a static allocator for RAM, but this tool is not going to be used much in the subsequent chapters. The middle section has almost nothing to do with Megadrive. I just plan for a demo effect and anticipate how many cycles I can spare for its implementation. The effect was never implemented, so feel free to skip this section as well.

I just finished watching these and I feel highly motivated. It’s a good thing to know you’re not the only psycho around:

Handling RAM and variables

It’s becoming apparent that we will need some variables in the near future. This can be as easy as pointing to an address from FF0000 onward, but I think such an approach will be troublesome later, when we have a lot of variables. Furthermore, we can’t allocate variables in the assembly source, because our binary will be mapped to ROM, so we can only put constants there.

I decided to write yet another tool. Let’s call it ‘stalloc’, for “static allocator”. The tool will parse a simple source file, like the following, and output a file containing assembler symbols assigned to their corresponding addresses:

; Global variables
-global { ; The minus sign means "bypass namespace prefix"
variable1 ; All variables are words
array[23] ; array elements are words

level1 { ; This is a namespace called "level1"
bossPosition[] = {0, 0} ; Variables and arrays can be initialized
bossHP = $100 ; Support for hex

level2 {
; ...
; ...

bonusLevel {
; ...
; ...

#mutex1 { ; This is a mutually exclusive group
bonusLevel ; Namespaces can be listed here
level* ; Why not? :)

The idea is that games and demos are usually composed of scenes, or levels where only one of them is active at any given time. So we can put all of them in the same mutually exclusive group, and they will occupy the same block of memory.

Namespaces that do not exist in an exclusion group will occupy their own block.

Also, for our convenience while looking at the memory contents, we’re going to align all blocks to 16 bytes (8 words).

The variables will be initialized from the source file to the value we want. We’ll just pass through the values to the assembler, so all values the assembler can recognize will be valid. The tool will output a macro that will initialize all variables to our desired values.

We’ll call the macro from our systemInit.x68, and then we’ll have variables to put our program state in, without caring where they actually are in memory.

Finally, it will be nice to enforce a minimum stack size, say 512 bytes, which will permit a simple recursion of depth 64. That should be enough.

So, let’s start. I’m doing this one in perl, which is my tool of choice when it comes to string manipulation.

And, here’s the tool’s output:

; ------------------------------------------------
; malloc.x68 - RAM allocation and initialization
; Automatically generated on 24/6/2016 16:41
; ------------------------------------------------
defc RAMStart = $FF0000

; Namespace global
defc hScroll = RAMStart + $0

; Variable initialization macro
macro initializeVariables
move.l #0, hScroll

Now, let’s use the ‘hScroll’ variable to hold our horizontal scroll value, instead of the register d1:

(in app.x68)

; appUpdate is called once per frame, when the active scan starts
; ------------------------------------------------------------------------
add.w #-2, hScroll


; appRender is called once per frame, when VBlank starts
; ------------------------------------------------------------------------
VDPPointToVRAM $F800
VDPWriteToData hScroll

Handle DMA bit not being reset when DMA copy ends

It’s a good thing I didn’t look into this right away, because during the previous day, I happened to catch a warning in the docs. It says that, due to a hardware “feature” (the nerve!), DMA transfers must be initiated from RAM. This means that you either start a transfer from code that is executed from RAM, or that the last write to the VDP control port is copied from a RAM address and not the ROM.

Now the easiest of course would be to try the second alternative. Since I don’t want to allocate a variable for this, I’m going to use the stack. Put the value I want to write to the stack, write it to the control port, and then pop it.

So, here’s how the last instruction ended up:

; The last write that initiates the DMA transfer must be from RAM (go figure)
move.w #(((destinationAddress >> 14) & 3) | $80), -(sp)
VDPWriteToControl (sp)
add.l #2, sp

The DMA bit is not cleared though. This may mean more emulator bugs. I’m just going to reset it manually, and set the machine to our default state:

; Reset #1 to the default value
VDPSetRegister 1, $4C ; R#01 : Display=1, VInterrupt=0, DMA=0, Vertical 30 Cell mode (PAL)

Also, let us wrap the entire DMA copy thing in a macro, as it seems to have matured enough:

; Fill VRAM with our cell data
VDPCopyRAMToVRAM GraphicsData, 0, $5000

Now ain’t this cleaner 🙂

Better graphics creation tool

From our previous experience while trying to create some graphics, it became obvious that we won’t get away by using bitmaps. Furthermore, I have checked some tile editors, but they are not exactly what you call “user friendly”. For better or worse, I don’t have artistically what it takes to fill the screen with beautiful graphics, and I certainly don’t think I can convince an artist to work with these tools.

I’m gravitating towards a new tool, which will take a carefully created bitmap (so that it reuses cells), and try to compress it, by identifying repeating cells.

As if this wasn’t hard enough to begin with, the tool must also identify mirrored cells (in both horizontal and vertical axes), and maybe it would also make sense to add some tolerance for one varying pixel or two, so we can trade image quality with compression rates.

I’m not going to make this tool right now, because it’s very complicated and I don’t currently have a use for it. I’m just leaving it as a note, in order to have the specs ready, when (if) I’m going to need a good amount of graphics.

And this takes care of our TO-DO list.

Planning a demo effect

I have been thinking about this for quite a while and it may look fantastic or horrible, but it’s worth a try.

This is a classic fake 3d effect I first saw in the arcades some time in late 80s. Sega’s “After Burner” had it in its intro. Here’s the mega drive version:

They apply 3D math to put the sprites in place, and also sort the sprites, so that the ones closer to the viewer cover the others.

My idea now, is to create a rotating sphere of spheres using this technique. But since that would be extremely lame as a demo effect (even for this hardware), I think I can make it a bit more interesting by adding a depth-of-field effect.

Basically, what we need in order to achieve this, is a series of ball sprites that are progressively blurred. So, the sprites that lie around our “camera” focal plane will be crisp, while the other sprites will be more and more blurred, as they move away from the focal plane.

Of course, this is going to look like shit, with just 1-bit transparency, so we must think of another way. Right now, my idea is to export a separate 8-bit alpha channel for each sprite. The alpha channel will be baked every frame to 1bpp by the cpu like this:

Every frame:
    Every pixel:
        Generate a random 8-bit number
        Compare the random number with the 8-bit alpha value for the pixel.
        If the random number is greater, make the pixel transparent
        Update the cell in VRAM

So, every sprite will appear slightly different in each frame. It will have a different dithering pattern, and the frequency of turning on and off semi-transparent pixels will be proportionate to their transparency.

It will obviously flicker as hell, but I think this is acceptable for this console generation. I remember the “Samurai Shodown” mega drive port used a flickery trick like this for player shadows. In that (awesome) game, the shadow from player 1 was visible in even frames and the shadow from player 2 in odd frames. The overall effect was that the shadows appeared to be semi-transparent, darkening the ground with a light flickering grey, but the region where the shadows overlapped appeared like a deforming pitch black shape, which made my heart skip a bit the first time I noticed it.

So, to make this effect we are going to need:

  • Some progressively blurred sprites
  • A random number generator
  • The best framerate we can get

Since we are extremely limited in terms of calculations per frame, and our plan involves per-pixel operations, I’m going to use extremely small sprites. Let’s start with 2×2.

Furthermore, we will take advantage of the cell mirroring the VDP does for free, so that we can update just one quarter of each sprite per frame. I think I can blur a 2×2 cell in GIMP and maintain the symmetry, by first blurring horizontally and then vertically. This way, the top-left cell will be symmetrical to the other 3 even after the blur.

We won’t need to do any pixel operations for the “focused”, crisp sprite. We can start with one blurred version and see how much of our frame time it consumes. Then we can add more blurred versions.

Some early estimations

Every blurred version will cost us 8×8 = 64 repetitions of the above pseudocode. If I remember correctly, we have about 20K “empty loop” repetitions during the active scan, so if we estimate this repetition to be 10x the duration of the empty loop (it’s a random number generation and a comparison versus an addition), we get to perform pixel operations on about 30 cells per frame.

Here is how 2×2 cell particles would appear on screen:


Which is ok-ish, but I’m not thrilled with the amount of pixels I can play with:


Since we estimate to have enough processing power, let’s try a larger sprite size. Say 4×4 cells. This is 32×32 pixels, and we only need to update the top-left 16×16 region because of the symmetry. Using the same estimation, we can afford about 8 such sprites per frame. I think we can get away with fewer, say 4 blurred versions, and save the rest of the cpu time for 3D rotation and sorting, which is not at all trivial.

And here is how the 4×4 sprites are supposed to look, in terms of size (blue blobs):


So, let’s hope our projections are correct (I think 10x is safe enough), and let’s create a symmetrical 4×4 sprite:


Aren’t I quite the artist! I added a hole in the center, which I think will make for a better depth-of-field effect.

Now that I look at it, maybe we should also update some transparent pixels on the “crisp” version of the sprite, just to add some antialiasing around the edges. That should make our render loop simpler as well.

Also let’s keep it as a note to add some palette rotation. The radial gradient just begs for it!

Luckily enough, our image came out to be symmetrical and divisible by 2 without any effort on our part. Now, we want to create the progressively blurred versions.

Since we are currently happy with the gradient (we don’t need any defined shapes in the sprite), I think I’ll skip blurring the color channels and just blur the alpha channel. For the gradient, the effect should look exactly the same.

And, after some tedious work of blurring and cropping, here are our bitmaps:


And now, for some random numbers…

Found this fast 8bit random number generator.

  • 3 additions
  • 3 xors
  • 1 shift

I ran some tests on the rng:


It spreads the random values uniformly, and it produces good noise, with no detectable patterns. The author states that it’s not good for cryptography, but it’s more than enough for our purpose.

So, here is the algorithm we need to implement:

// initialize (s1,s2,s3 are 3 seed bytes)
x = 0;
a = s1;
b = s2;
c = s3;

// calculate next random
a = (a^c^x);
b = (b+a);
c = (c+(b>>1)^a);

I’ll have to check again the C operator precedence (I’m not sure what comes first, addition or XOR), but before that, I realize this is a routine we’ll be calling too frequently.

Furthermore, the generator state is just 4 bytes, so we can fit it in our registers. I’m going to use 4 registers for now, even though we could fit the entire state in 1 32 bit register. Thankfully though, the 68K has plenty of registers and we are more interested in speed right now.

Another thing I want to achieve is make this a routine and not a macro, so it doesn’t get copied around multiple times, but, I don’t want to pay for pushing/popping the return address to/from the stack. The algorithm fits nicely in our registers and it would be a shame to spend 2 memory accesses for merely calling it!

Since the routine will not call any other routines, it’s safe to just put the return address to an address register (again, we’ve got plenty of them!). I think this is called a “pseudoroutine”. I may be wrong, but I like the name.

And because I may be creating more routines of this kind, I think I’m going to make a couple of macros, for calling/returning from “pseudoroutines”:

macro pseudoCall, addr
move.l #pseudoCall\@ , a6 ; a6 will hold the return address. We use '#' to take the label as a number and not what it points to in memory
jmp \addr
pseudoCall\@: ; \@ will create a unique id "_nnnnnn" (n=number), unique for every instance of the macro

macro pseudoRet
jmp a6 ; a6 better not have changed!

I’ve checked the C precedence. So, addition comes first and then XORing follows.

We are now almost ready to implement our random number generator in 68K assembly. If only we knew how to XOR two values…

… of course they had to name it EOR (Exlusive OR) …

Let’s start by assigning register names to the C variables. Following our “put stuff at the bottom” tradition, we’ll use d4, d5, d6 and d7. Also, we’re going to spread the calculations so that we have one operation per line:

; Seed random number generator, using a 32 bit seed stored in d0
; This needn't be a pseudoroutine, but just for symmetry with the other routine, we'll make it one
d4 = 0;
d5 = s1;
d6 = s2;
d7 = s3;

; Calculate next random number, it will be stored in d7
d5 ^= d7;
d5 ^= d4;
d6 += d5;

d4 <<= 9; ; Hack - shift d4 to the left, so we can use its 8 LSB for storing the intermediate value move.b d6, d4 ; - Tmp store d6 d4 >>= 1; ; - This will produce d6>>1 in the LSB

d7 +=.b d4; ; I wonder if I will be able to remember the symbolism later :)

d4 >>= 8; ; Hack - Revert d4 to what it was before

d7 ^= d5;

And… if we count bitwise operations to take as many cycles as additions, we are already past the 10x estimate. The compact C syntax can be misleading.

We could have made it a bit faster by using another register, say s3, to hold the intermediate value, but I’m already occupying half our data registers for the RNG and I don’t feel comfortable about it. For now I’m going to tell myself that the XORs cost less than the additions, and we’ll see later if that’s the case.

So, let’s convert that nonsense to 68K assembly. Bit shifts are restricted to a maximum of 8 bit positions per instruction, so I used another trick in place of the >>9. I shifted left 8 bits with instruction size of ‘w’ (word), so that our byte rests safely in the second least significant byte of the register. Then to calculate the >>1, I shift with instruction size of ‘b’ (byte), so that our stored byte remains intact. Good stuff.

; Seed random number generator, using a 32 bit seed stored in d0 (s0s1s2s3 high to low bytes)
; This needn't be a pseudoroutine, but just for symmetry with the following routine, we'll make it one
move.b d0, d4
lsr.l #8, d0
move.b d0, d5
lsr.l #8, d0
move.b d0, d6
lsr.l #8, d0
move.b d0, d7

; Calculate next random number, it will be stored in the low byte of d7
add.l #1, d4
eor.b d7, d5
eor.b d4, d5
add.b d5, d6

lsl.w #8, d4 ; Hack - shift d4 to the left, so we can use its 8 LSB for storing the intermediate value
move.b d6, d4 ; - Tmp store d6
lsr.b #1, d4 ; - This will produce d6>>1 in the LSB

add.b d4, d7 ; Yup! I remebered it!

lsr.w #8, d4 ; Hack - Revert d4 to what it was before

eor.b d5, d7


To test it, I produced 10 random numbers in both 68K assembly and the C implementation, using the same seed (all zeroes).

It may not seem like much, but this is actually the first algorithm I have ever implemented in an alien assembly, and I’m extremely happy my quick transcript from C worked right away!

Dawn is breaking.

Need… to… sleep…


Lost you just yet? I hope not. Keep on reading because the real action is about to start!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s