If you’ve read the story on my programming background you’ll know that the large amount of my career has been spent in higher-level, web-focused technologies. This means there were a lot of gaps in my knowledge when it came to lower-level concepts, like kernel-space and assembly.

Wanting to remedy this I decided to start a personal project to to learn the ins and outs of computers the fun way. By building an emulator for hardware that never existed.

Specifically this led to creating a “fantasy game console” — which I’ve named Retrograde. The idea initially was a console from 1998 that supported limited APIs for 3D graphics, audio, and game save storage. I also wanted to create a visual shell for the console, similar to the one found on the Sega Dreamcast.

#Designing specifications for the “hardware”

I wanted something that was a bit more advanced than a Nintendo 64 or PlayStation, but would still feel era appropriate for the year of 1998. This was the same year the Dreamcast came out1, so I used various specifications from the console as a starting point.

I decided I didn’t want to have this console be in 64-bit territory yet, meaning it would be a 32-bit system. Initially it was a MIPS-inspired RISC architecture2, but I later changed it to a custom CISC architecture as I found x86’s approach to stack management more intuitive.

The console offered a fixed-function pipeline for 3D rendering3 with hardware-accelerated T&L4. This allowed for resolutions of 640x480 or 800x600 — with both resolutions supporting 16-bit color and an 8-bit depth buffer. For audio, the console supported 16-bit stereo audio with a 44.1kHz sample rate.

Finally getting into storage, the console supported 16MB of system RAM, with 4MB of video memory. It also allowed for save data to be stored in 4KB blocks within a “memory card”5 that supported up to 128 blocks (512KB total). For read-only media, the console had an API for an RG-ROM drive, which in reality booted up a custom ROM format designed for the console.

#How do we tell a computer what to do?

Implementing the CPU in code seemed like the most logical place to start — after all, it’s the central processing unit. I started by defining the opcodes and instruction set, which was the first major set of learnings that came out of this project. I understood opcodes in theory, as I’ve worked with things like bytecode interpreters before, but proper machine code is a similar, yet different beast.

The bytecode interpreters I had used in the past leveraged a full byte for the opcode, with the following bytes being interpreted based on which opcode byte was read. Machine code doesn’t work like that. Instead, you have “instruction words” that are a fixed length and contain all the information needed to execute the instruction.

As I mentioned previously, I used MIPS as a starting point for my CPU architecture which influenced the initial design of my instruction set. Each instruction word was 16-bits long with the lower 6 bits of each instruction being the opcode6. This determined the categorization and layout for the rest of the instruction.

There are three word layout classifications for instructions. The names of these are based on the division of bits in the word after the opcode (which is always the first 6 bits).

  • 4-4-2: Most common layout representing two 4-bit register operands and a 2-bit flag. The first register is always intended to be used as the destination or target register, with the second register operating as a source or condition register.
  • 4-6: Represents a 4-bit register operand and a 6-bit immediate value. Used by the shift operations.
  • 10: Represents a 10-bit immediate value. Used by the system call instruction.
Here's an example of a 4-4-2 instruction word layout:

oooooo tttt ssss ff
^ ------------------- opcode is the first 6 bits
       ^ ------------ target register is the next 4 bits
            ^ ------- source register is the last 4 bits
                 ^ -- flags are the last 2 bits

#Registers, SIMD, and the stack

There are 16 general purpose registers registers that can be used to store 32-bit integers or floating point numbers. The registers are numbered from 0 to 15, and can be accessed using the $ prefix followed by the register number. For example, register 0 is $0, register 1 is $1, …

There is also a vector processing unit that can operate on 128-bit SIMD7 registers. The SIMD registers are numbered from 0 to 3, and can be accessed using the $v prefix followed by the register number. For example, SIMD register 0 is $v0, SIMD register 1 is $v1, …

#The instruction set

If you’re interested, here’s the instruction set available when creating software for Retrograde using Retrograde Assembly8.

InstructionASM ExampleDescription
Movemov $1, $2Copies the value of register $2 into register $1. Equivalent to $1 = $2.
Addadd $1, $2Adds the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 += $2.
Subtractsub $1, $2Subtracts the value of register $2 from the value of register $1, then stores the result in register $1. Equivalent to $1 -= $2.
Multiplymul $1, $2Multiplies the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 *= $2.
Dividediv $1, $2Divides the value of register $1 by the value of register $2, then stores the result in register $1. Equivalent to $1 /= $2.
Remainderrem $1, $2Divides the value of register $1 by the value of register $2, then stores the remainder in register $1. Equivalent to $1 %= $2.
Andand $1, $2Performs a bitwise AND operation on the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 &= $2.
Oror $1, $2Performs a bitwise OR operation on the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 |= $2.
Xorxor $1, $2Performs a bitwise XOR operation on the values of registers $1 and $g0, then stores the result in register $1. Equivalent to $1 ^= $2.
Shift Left Logicallshift $1, 16Shifts the value of register $1 left by 16 bits, then stores the result in register $1. The shifted bits are filled with zeros. Equivalent to $1 = $1 << 16.
Shift Right Logicalrshift $1, 16Shifts the value of register $1 right by 16 bits, then stores the result in register $1. The shifted bits are filled with zeros. Equivalent to $1 = $1 >> 16.
Pushpush $1Pushes the value of register $1 onto the stack. Equivalent to MEM[$sp] = $1; $sp -= 4.
Poppop $1Pops the value off the stack and stores it in register $1. Equivalent to $1 = MEM[$sp]; $sp += 4.
Jump to Address by Registerjmp $1Unconditionally jumps to the address specified by the value of register $1. Equivalent to $pc = $1.
Jump to Addressjmp addrUnconditionally jumps to the address (next 2 words). Equivalent to $pc = addr.
Jump If Zerojz $1, $2Jumps to the address specified by the value of register $1 if the value of register $2 is zero. Equivalent to if ($2 == 0) $pc = $1.
Jump If Not Zerojnz $1, $2Jumps to the address specified by the value of register $1 if the value of register $2 is not zero. Equivalent to if ($2 != 0) $pc = $1.
Call by Registercall $1Calls the function at the address specified by the value of register $1. Equivalent to $1().
MEM[$sp] = $pc; $sp -= 4; $pc = $1
Call by Addresscall addrCalls the function at the address (next 2 words).
MEM[$sp] = $pc; $sp -= 4; $pc = addr
ReturnretReturns from the current function by jumping to the return address stored on the stack.
$pc = MEM[$sp]; $sp += 4
Load Immediateload $1, immLoads the 32-bit immediate value (next 2 words) into register $1. Equivalent to $1 = imm.
Load from Addressload $1, addrLoads a 32-bit value from memory at the address (next 2 words), then stores the result in register $1. Equivalent to $1 = MEM[addr].
Load from Address in Registerload $1, $2Loads a 32-bit value from memory at the address specified by the value of register $2, then stores the result in register $1. Equivalent to $1 = MEM[$2].
Store Addressstore $1, addrStores the value of register $1 in memory at the address (next 2 words). Equivalent to MEM[addr] = $1.
Store at Address in Registerstore $1, $2Stores the value of register $1 in memory at the address specified by the value of register $2. Equivalent to MEM[$2] = $1.
System Callsyscall 0x10Makes a system call with the specified system call number. Equivalent to sys.0x10().
HalthaltStops execution. This basically just stops the emulator or restarts it using the default ROM (shell) if one is available.
Add (Vectors)vadd $v1, $v2Adds the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the results in $v1. Equivalent to $v1 += $v2.
Subtract (Vectors)vsub $v1, $v2Subtracts the four 32-bit floating-point values in SIMD register $v2 from the values in $v1, then stores the results in $v1. Equivalent to $v1 -= $v2.
Multiply (Vectors)vmul $v1, $v2Multiplies the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the results in $v1. Equivalent to $v1 *= $v2.
Divide (Vectors)vdiv $v1, $v2Divides the four 32-bit floating-point values in SIMD register $v1 by the values in $v2, then stores the results in $v1. Equivalent to $v1 /= $v2.
Minimum (Vectors)vmin $v1, $v2Compares the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the minimum values in $v1. Equivalent to $v1 = min($v1, $v2).
Maximum (Vectors)vmax $v1, $v2Compares the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the maximum values in $v1. Equivalent to $v1 = max($v1, $v2).
Square Root (Vectors)vsqrt $v1Calculates the square root of the four 32-bit floating-point values in SIMD register $v1, then stores the results in $v1. Equivalent to $v1 = sqrt($v1).
Reciprocal Square Root (Vectors)vrcp $v1Calculates the reciprocal square root of the four 32-bit floating-point values in SIMD register $v1, then stores the results in $v1. Equivalent to $v1 = 1 / sqrt($v1).
Load (Vectors)vload $v1, $2Loads a 128-bit SIMD value from memory at the address specified by the value of register $2, then stores the result in $v1. Equivalent to $v1 = MEM[$2].
Store (Vectors)vstore $v1, $2Stores the 128-bit SIMD value in SIMD register $v1 in memory at the address specified by the value of register $2. Equivalent to MEM[$2] = $v1.
Instruction Flags

For standard arithmetic operations, the flag bits are used to distinguish between signed, unsigned, and floating point operations. This means the instructions below actually only represent 5 opcodes, with the flags determining the exact operation. A flag of 0b00 performs unsigned arithmetic, 0b01 performs signed arithmetic, and 0b10 performs floating point arithmetic.

For jump instructions, the flag bits are used to determine the type of jump. A flag of 0b00 is an unconditional jump, 0b01 is a jump if zero, and 0b10 is a jump if not zero.

For load instructions, the flag bits are used to determine the source of the value to load. A flag of 0b00 loads a 32-bit value from memory at the address specified by the value of register $2, 0b01 loads a 32-bit immediate value (next 2 words), and 0b10 loads a 32-bit value from memory at the address (next 2 words).

For store instructions, the flag bits are used to determine the destination of the value to store. A flag of 0b00 stores the value of register $1 in memory at the address specified by the value of register $2, and 0b01 stores the value of register $1 in memory at the address (next 2 words).

#Fleshing out the hardware components

With our instructions defined, I could begin implementing the CPU in code. Every file in my project quickly became a skeuomorphic representation of the piece of hardware it was cos-playing. So cpu.h/c housed the CPU, memory.h/c was the memory, rgrom.h/c housed the “ROM” drive which would read our software from “disk” and so on.

// Here's a snippet from cpu.h showing the layout of the CPU
// code component in the emulator.

/// Represents the CPU for the Retrograde console.
typedef struct {
  bool halted : 1; /// Is the CPU halted?
  u32 regs[16];    /// Our 16 general-purpose registers.
  float vregs[16]; /// Our 4 vector registers (flattened)
  u32 pc;          /// Program counter.
  u32 sp;          /// Stack pointer.
} Cpu;

Our entry point for the whole thing was emulator.c, which created a window using SDL2 and initialized the various components of the system. Once ready, it would just attempt to load a ROM binary from the filepath passed in as an argument and run it9.

The “GPU” of the console was really just OpenGL10 with a fixed set of shaders that simulated the look of games from the late 90s. This meant using vertex-shading instead of pixel-shading, and was more an exercise in accounting for the various effects that the software might want to use11.

#Mapping the memory

To properly appreciate how the low-level program works, I actually spawned a 2-week side project where I built a custom kernel for a Raspberry Pi. While this kernel isn’t useful for any kind of actual work, it was monumental in helping me understand how code interacts with hardware.

The biggest of these learning was probably memory-mapped I/O (MMIO) and how the CPU interacts with the various components of the system. This was something I had to implement in my emulator to properly simulate the console’s hardware.

/// The following values define where each peripheral device is mapped in
/// memory. The console only has 16MB of addressable memory, so the MMIO
/// addresses start well beyond this range to avoid conflicts with the
/// main system memory addresses.

enum {
  MMIO_BASE = 0xC0000000, /// Base address MMIO (this would be a 3GB offset)

  PERIPHERAL_BASE = 0x1F000000, /// Where the peripherals MMIO starts.

  INPUT_DEVICE_BASE = 0x1F000000, /// Where the input device MMIO starts.
  INPUT_STATUS = INPUT_DEVICE_BASE + 0x00, /// Input device status register.
  INPUT_DATA = INPUT_DEVICE_BASE + 0x04,   /// Input device data register.

  DISK_DRIVE_BASE = 0x1F100000,          /// Where the disk drive MMIO starts.
  DISK_STATUS = DISK_DRIVE_BASE + 0x00,  /// Disk drive status register.
  DISK_COMMAND = DISK_DRIVE_BASE + 0x04, /// Disk drive command register.
  DISK_SECTOR = DISK_DRIVE_BASE + 0x08,  /// Disk drive sector register.
  DISK_BUFFER = DISK_DRIVE_BASE + 0x0C,  /// Disk drive buffer register.

  MEMCARD_BASE = 0x1F200000,             /// Where the memory card MMIO starts.
  MEMCARD_STATUS = MEMCARD_BASE + 0x00,  /// Memory card status register.
  MEMCARD_COMMAND = MEMCARD_BASE + 0x04, /// Memory card command register.
  MEMCARD_ADDRESS = MEMCARD_BASE + 0x08, /// Memory card address register.
  MEMCARD_DATA = MEMCARD_BASE + 0x0C,    /// Memory card data register.

  GPU_BASE = 0x1F300000,              /// Where the GPU MMIO starts.
  GPU_STATUS = GPU_BASE + 0x00,       /// GPU status register.
  GPU_COMMAND = GPU_BASE + 0x04,      /// GPU command register.
  GPU_PRIMITIVE = GPU_BASE + 0x08,    /// GPU primitive register.
  GPU_VERTEX_DATA = GPU_BASE + 0x0C,  /// GPU vertex data register.
  GPU_TEXTURE_DATA = GPU_BASE + 0x10, /// GPU texture data register.
};

#Designing a custom binary format

Before long I had a lot of what I needed to run a program in my emulator. But now I needed a program! Step one was to create define another binary layout, this time for something much larger. At this point I had a name for my console and I decided to to do the most 90s thing possible and call it a “Retrograde eXecutable” or an .rgx file.

These executable files cannot exceed 1GB in size and were stored in little-endian format. It featured a header that contained information about the application (that could be used by a visual shell or app launcher), before the actual code and assets for the program.

NameBytesDescription
Magic Header3ASCII “rgx”
Packing Flags1Flags for how the file is packed, includes things like “debug build”
App Name64ASCII name of the application
App Version16ASCII version of the application
App Author64ASCII name of the author
App Description256ASCII description of the application
Cover Image Data512x512x3RGB image data for the cover image
Entry Point4Entry point for the program
Code Size4Size of the code section
Code SectionNCompiled instructions
Big Data Size4Size of the big data section
Big Data SectionNContains the asset data

#How do I build software for this console?

Writing machine code by hand is… well damn near impossible for anything sufficiently complex. I know back in the age of dinosaurs they would program binary using punch cards, but that was to do basic arithmetic and logic operations; not create a full piece of software, let alone a game with 3D and sound.

This is where language toolchains come in12. Toolchains are just a set of tools that work together to create software. Classically, these would be several tools that work together to process each stage of transformation from source code to machine code13.

Because I ultimately just want a textual version of the machine code with a handful of “nice-to-have” utilities14, our toolchain has an extremely simple lexer and parser that feeds directly into the assembler. Then the linker takes all the assembled files and combines them into a single binary file that can be loaded into the emulator.

While initially I had created the SDK using plain C and emulating the distributed toolchain of GCC15, I eventually moved to using Go to create a complete toolchain within a single CLI tool. This was far easier to write and use, allowing me to createe a TOML file with project meta information and a folder full of .rgs files that would be compiled into an .rgx file.

#Creating a custom assembly language

Now that we have an instruction set and a CPU to run them, it was time to create an assembly language that could be used to generate our machine code and produce executable files. Unoriginally, I called it Retrograde Assembly and gave it the file extension .rgs16.

And yes, I created a TextMate grammar for it along with a VSCode extension to provide syntax highlighting and autocompletion.

; comments start with a semicolon and consume the rest of the line

my_byte: .byte 0x12, 0x13           ; Declare two bytes with the value 0x12 and 0x13
my_short: .2byte 0x1234             ; Declare a 16-bit "word" with the value 0x1234
my_int: .4byte 0x12345678           ; Declare a 32-bit int with the value 0x12345678
my_float: .float 3.14               ; Declare a 32-bit float with the value 3.14
my_string: .ascii "Hello, World!"   ; Declare a string
my_string2: .asciz "Hello, World!"  ; Declare a null-terminated string
my_bytes: .space 16                 ; Allocate 16 bytes of space
my_embed: .incbin "file.bin"        ; Embed a binary file with the size prepended (4 bytes)

main:
    mov $1, $0        ; Clear $1 by moving 0 into it
    load $1, my_int   ; Load the address of `my_int` into $1
    add $1, $1        ; Add the value of `my_int` to itself
    push $1           ; Push the sum onto the stack
    load $2, my_func  ; Load the address of `my_func` into $2
    call $2           ; Call the function `my_func`
    pop $5            ; Pop our previous sum off the stack
    add $1, $5        ; Add our previous sum to the result of `my_func`
    halt              ; Stop execution

my_func:
    mul $1, $1        ; Multiply the value of $1 by itself
    ret               ; Return from the function

#Writing assembly by hand sucks

Since I was already this far in, I decided that creating a custom compiler for C wouldn’t be too big a stretch.

Um… yeah. I was wrong.

Creating a lexer and parser isn’t the most complex thing in the world — unless it’s a language like C where a variety of parsing strategies fall apart17. You also have to account for things like type checking, symbol resolution… oh, and don’t forget about the preprocessor!

This led to me making a partial compiler for a subset of C. Basically just enough to write “Hello World” and a few other simple programs that could some arithmetic, control flow, and function calls. It was slow and very clunky - but it worked!

#Where this project ended up

Like many of my projects, I was pulled away in order to make some money doing contract work. I was realizing that, while a very cool concept (to me), the amount of effort to build any kind of real software for this console was going to be immense.

I was already starting on the Conjure programming language at the time and considered making Retrograde a special target that my language could compile to. But after some time away, I realized that this version of the project had served its purpose for me.

I had a different idea of where I wanted to take the project and the purpose I wanted it to fulfill. This led to the current version of the project that I am slowly working on.