If you’ve read the story on my programming background you’ll know that the large amount of my career has been spent in higher-level, web-focused technologies. This means there were a lot of gaps in my knowledge when it came to lower-level concepts, like kernel-space and assembly.
Wanting to remedy this I decided to start a personal project to to learn the ins and outs of computers the fun way. By building an emulator for hardware that never existed.
Specifically this led to creating a “fantasy game console” — which I’ve named Retrograde. The idea initially was a console from 1998 that supported limited APIs for 3D graphics, audio, and game save storage. I also wanted to create a visual shell for the console, similar to the one found on the Sega Dreamcast.
#Designing specifications for the “hardware”
I wanted something that was a bit more advanced than a Nintendo 64 or PlayStation, but would still feel era appropriate for the year of 1998. This was the same year the Dreamcast came out1, so I used various specifications from the console as a starting point.
I decided I didn’t want to have this console be in 64-bit territory yet, meaning it would be a 32-bit system. Initially it was a MIPS-inspired RISC architecture2, but I later changed it to a custom CISC architecture as I found x86’s approach to stack management more intuitive.
The console offered a fixed-function pipeline for 3D rendering3 with hardware-accelerated T&L4. This allowed for resolutions of 640x480 or 800x600 — with both resolutions supporting 16-bit color and an 8-bit depth buffer. For audio, the console supported 16-bit stereo audio with a 44.1kHz sample rate.
Finally getting into storage, the console supported 16MB of system RAM, with 4MB of video memory. It also allowed for save data to be stored in 4KB blocks within a “memory card”5 that supported up to 128 blocks (512KB total). For read-only media, the console had an API for an RG-ROM drive, which in reality booted up a custom ROM format designed for the console.
#How do we tell a computer what to do?
Implementing the CPU in code seemed like the most logical place to start — after all, it’s the central processing unit. I started by defining the opcodes and instruction set, which was the first major set of learnings that came out of this project. I understood opcodes in theory, as I’ve worked with things like bytecode interpreters before, but proper machine code is a similar, yet different beast.
The bytecode interpreters I had used in the past leveraged a full byte
for the opcode, with the following bytes being interpreted based on which opcode byte
was read. Machine code doesn’t work like that. Instead, you have “instruction words” that are a fixed length and contain all the information needed to execute the instruction.
As I mentioned previously, I used MIPS as a starting point for my CPU architecture which influenced the initial design of my instruction set. Each instruction word was 16-bits long with the lower 6 bits of each instruction being the opcode6. This determined the categorization and layout for the rest of the instruction.
There are three word layout classifications for instructions. The names of these are based on the division of bits in the word after the opcode (which is always the first 6 bits).
- 4-4-2: Most common layout representing two 4-bit register operands and a 2-bit flag. The first register is always intended to be used as the destination or target register, with the second register operating as a source or condition register.
- 4-6: Represents a 4-bit register operand and a 6-bit immediate value. Used by the shift operations.
- 10: Represents a 10-bit immediate value. Used by the system call instruction.
Here's an example of a 4-4-2 instruction word layout:
oooooo tttt ssss ff
^ ------------------- opcode is the first 6 bits
^ ------------ target register is the next 4 bits
^ ------- source register is the last 4 bits
^ -- flags are the last 2 bits
#Registers, SIMD, and the stack
There are 16 general purpose registers registers that can be used to store 32-bit integers or floating point numbers. The registers are numbered from 0 to 15, and can be accessed using the $
prefix followed by the register number. For example, register 0 is $0
, register 1 is $1
, …
There is also a vector processing unit that can operate on 128-bit SIMD7 registers. The SIMD registers are numbered from 0 to 3, and can be accessed using the $v
prefix followed by the register number. For example, SIMD register 0 is $v0
, SIMD register 1 is $v1
, …
#The instruction set
If you’re interested, here’s the instruction set available when creating software for Retrograde using Retrograde Assembly8.
Instruction | ASM Example | Description |
---|---|---|
Move | mov $1, $2 | Copies the value of register $2 into register $1. Equivalent to $1 = $2 . |
Add | add $1, $2 | Adds the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 += $2 . |
Subtract | sub $1, $2 | Subtracts the value of register $2 from the value of register $1, then stores the result in register $1. Equivalent to $1 -= $2 . |
Multiply | mul $1, $2 | Multiplies the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 *= $2 . |
Divide | div $1, $2 | Divides the value of register $1 by the value of register $2, then stores the result in register $1. Equivalent to $1 /= $2 . |
Remainder | rem $1, $2 | Divides the value of register $1 by the value of register $2, then stores the remainder in register $1. Equivalent to $1 %= $2 . |
And | and $1, $2 | Performs a bitwise AND operation on the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 &= $2 . |
Or | or $1, $2 | Performs a bitwise OR operation on the values of registers $1 and $2, then stores the result in register $1. Equivalent to $1 |= $2 . |
Xor | xor $1, $2 | Performs a bitwise XOR operation on the values of registers $1 and $g0, then stores the result in register $1. Equivalent to $1 ^= $2 . |
Shift Left Logical | lshift $1, 16 | Shifts the value of register $1 left by 16 bits, then stores the result in register $1. The shifted bits are filled with zeros. Equivalent to $1 = $1 << 16 . |
Shift Right Logical | rshift $1, 16 | Shifts the value of register $1 right by 16 bits, then stores the result in register $1. The shifted bits are filled with zeros. Equivalent to $1 = $1 >> 16 . |
Push | push $1 | Pushes the value of register $1 onto the stack. Equivalent to MEM[$sp] = $1; $sp -= 4 . |
Pop | pop $1 | Pops the value off the stack and stores it in register $1. Equivalent to $1 = MEM[$sp]; $sp += 4 . |
Jump to Address by Register | jmp $1 | Unconditionally jumps to the address specified by the value of register $1. Equivalent to $pc = $1 . |
Jump to Address | jmp addr | Unconditionally jumps to the address (next 2 words). Equivalent to $pc = addr . |
Jump If Zero | jz $1, $2 | Jumps to the address specified by the value of register $1 if the value of register $2 is zero. Equivalent to if ($2 == 0) $pc = $1 . |
Jump If Not Zero | jnz $1, $2 | Jumps to the address specified by the value of register $1 if the value of register $2 is not zero. Equivalent to if ($2 != 0) $pc = $1 . |
Call by Register | call $1 | Calls the function at the address specified by the value of register $1. Equivalent to $1() . MEM[$sp] = $pc; $sp -= 4; $pc = $1 |
Call by Address | call addr | Calls the function at the address (next 2 words). MEM[$sp] = $pc; $sp -= 4; $pc = addr |
Return | ret | Returns from the current function by jumping to the return address stored on the stack. $pc = MEM[$sp]; $sp += 4 |
Load Immediate | load $1, imm | Loads the 32-bit immediate value (next 2 words) into register $1. Equivalent to $1 = imm . |
Load from Address | load $1, addr | Loads a 32-bit value from memory at the address (next 2 words), then stores the result in register $1. Equivalent to $1 = MEM[addr] . |
Load from Address in Register | load $1, $2 | Loads a 32-bit value from memory at the address specified by the value of register $2, then stores the result in register $1. Equivalent to $1 = MEM[$2] . |
Store Address | store $1, addr | Stores the value of register $1 in memory at the address (next 2 words). Equivalent to MEM[addr] = $1 . |
Store at Address in Register | store $1, $2 | Stores the value of register $1 in memory at the address specified by the value of register $2. Equivalent to MEM[$2] = $1 . |
System Call | syscall 0x10 | Makes a system call with the specified system call number. Equivalent to sys.0x10() . |
Halt | halt | Stops execution. This basically just stops the emulator or restarts it using the default ROM (shell) if one is available. |
Add (Vectors) | vadd $v1, $v2 | Adds the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the results in $v1. Equivalent to $v1 += $v2 . |
Subtract (Vectors) | vsub $v1, $v2 | Subtracts the four 32-bit floating-point values in SIMD register $v2 from the values in $v1, then stores the results in $v1. Equivalent to $v1 -= $v2 . |
Multiply (Vectors) | vmul $v1, $v2 | Multiplies the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the results in $v1. Equivalent to $v1 *= $v2 . |
Divide (Vectors) | vdiv $v1, $v2 | Divides the four 32-bit floating-point values in SIMD register $v1 by the values in $v2, then stores the results in $v1. Equivalent to $v1 /= $v2 . |
Minimum (Vectors) | vmin $v1, $v2 | Compares the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the minimum values in $v1. Equivalent to $v1 = min($v1, $v2) . |
Maximum (Vectors) | vmax $v1, $v2 | Compares the four 32-bit floating-point values in SIMD registers $v1 and $v2, then stores the maximum values in $v1. Equivalent to $v1 = max($v1, $v2) . |
Square Root (Vectors) | vsqrt $v1 | Calculates the square root of the four 32-bit floating-point values in SIMD register $v1, then stores the results in $v1. Equivalent to $v1 = sqrt($v1) . |
Reciprocal Square Root (Vectors) | vrcp $v1 | Calculates the reciprocal square root of the four 32-bit floating-point values in SIMD register $v1, then stores the results in $v1. Equivalent to $v1 = 1 / sqrt($v1) . |
Load (Vectors) | vload $v1, $2 | Loads a 128-bit SIMD value from memory at the address specified by the value of register $2, then stores the result in $v1. Equivalent to $v1 = MEM[$2] . |
Store (Vectors) | vstore $v1, $2 | Stores the 128-bit SIMD value in SIMD register $v1 in memory at the address specified by the value of register $2. Equivalent to MEM[$2] = $v1 . |
Instruction Flags
For standard arithmetic operations, the flag bits are used to distinguish between signed, unsigned, and floating point operations. This means the instructions below actually only represent 5 opcodes, with the flags determining the exact operation. A flag of 0b00
performs unsigned arithmetic, 0b01
performs signed arithmetic, and 0b10
performs floating point arithmetic.
For jump instructions, the flag bits are used to determine the type of jump. A flag of 0b00
is an unconditional jump, 0b01
is a jump if zero, and 0b10
is a jump if not zero.
For load instructions, the flag bits are used to determine the source of the value to load. A flag of 0b00
loads a 32-bit value from memory at the address specified by the value of register $2
, 0b01
loads a 32-bit immediate value (next 2 words), and 0b10
loads a 32-bit value from memory at the address (next 2 words).
For store instructions, the flag bits are used to determine the destination of the value to store. A flag of 0b00
stores the value of register $1
in memory at the address specified by the value of register $2
, and 0b01
stores the value of register $1
in memory at the address (next 2 words).
#Fleshing out the hardware components
With our instructions defined, I could begin implementing the CPU in code. Every file in my project quickly became a skeuomorphic representation of the piece of hardware it was cos-playing. So cpu.h/c
housed the CPU, memory.h/c
was the memory, rgrom.h/c
housed the “ROM” drive which would read our software from “disk” and so on.
// Here's a snippet from cpu.h showing the layout of the CPU
// code component in the emulator.
/// Represents the CPU for the Retrograde console.
typedef struct {
bool halted : 1; /// Is the CPU halted?
u32 regs[16]; /// Our 16 general-purpose registers.
float vregs[16]; /// Our 4 vector registers (flattened)
u32 pc; /// Program counter.
u32 sp; /// Stack pointer.
} Cpu;
Our entry point for the whole thing was emulator.c
, which created a window using SDL2 and initialized the various components of the system. Once ready, it would just attempt to load a ROM binary from the filepath passed in as an argument and run it9.
The “GPU” of the console was really just OpenGL10 with a fixed set of shaders that simulated the look of games from the late 90s. This meant using vertex-shading instead of pixel-shading, and was more an exercise in accounting for the various effects that the software might want to use11.
#Mapping the memory
To properly appreciate how the low-level program works, I actually spawned a 2-week side project where I built a custom kernel for a Raspberry Pi. While this kernel isn’t useful for any kind of actual work, it was monumental in helping me understand how code interacts with hardware.
The biggest of these learning was probably memory-mapped I/O (MMIO) and how the CPU interacts with the various components of the system. This was something I had to implement in my emulator to properly simulate the console’s hardware.
/// The following values define where each peripheral device is mapped in
/// memory. The console only has 16MB of addressable memory, so the MMIO
/// addresses start well beyond this range to avoid conflicts with the
/// main system memory addresses.
enum {
MMIO_BASE = 0xC0000000, /// Base address MMIO (this would be a 3GB offset)
PERIPHERAL_BASE = 0x1F000000, /// Where the peripherals MMIO starts.
INPUT_DEVICE_BASE = 0x1F000000, /// Where the input device MMIO starts.
INPUT_STATUS = INPUT_DEVICE_BASE + 0x00, /// Input device status register.
INPUT_DATA = INPUT_DEVICE_BASE + 0x04, /// Input device data register.
DISK_DRIVE_BASE = 0x1F100000, /// Where the disk drive MMIO starts.
DISK_STATUS = DISK_DRIVE_BASE + 0x00, /// Disk drive status register.
DISK_COMMAND = DISK_DRIVE_BASE + 0x04, /// Disk drive command register.
DISK_SECTOR = DISK_DRIVE_BASE + 0x08, /// Disk drive sector register.
DISK_BUFFER = DISK_DRIVE_BASE + 0x0C, /// Disk drive buffer register.
MEMCARD_BASE = 0x1F200000, /// Where the memory card MMIO starts.
MEMCARD_STATUS = MEMCARD_BASE + 0x00, /// Memory card status register.
MEMCARD_COMMAND = MEMCARD_BASE + 0x04, /// Memory card command register.
MEMCARD_ADDRESS = MEMCARD_BASE + 0x08, /// Memory card address register.
MEMCARD_DATA = MEMCARD_BASE + 0x0C, /// Memory card data register.
GPU_BASE = 0x1F300000, /// Where the GPU MMIO starts.
GPU_STATUS = GPU_BASE + 0x00, /// GPU status register.
GPU_COMMAND = GPU_BASE + 0x04, /// GPU command register.
GPU_PRIMITIVE = GPU_BASE + 0x08, /// GPU primitive register.
GPU_VERTEX_DATA = GPU_BASE + 0x0C, /// GPU vertex data register.
GPU_TEXTURE_DATA = GPU_BASE + 0x10, /// GPU texture data register.
};
#Designing a custom binary format
Before long I had a lot of what I needed to run a program in my emulator. But now I needed a program! Step one was to create define another binary layout, this time for something much larger. At this point I had a name for my console and I decided to to do the most 90s thing possible and call it a “Retrograde eXecutable” or an .rgx
file.
These executable files cannot exceed 1GB in size and were stored in little-endian format. It featured a header that contained information about the application (that could be used by a visual shell or app launcher), before the actual code and assets for the program.
Name | Bytes | Description |
---|---|---|
Magic Header | 3 | ASCII “rgx” |
Packing Flags | 1 | Flags for how the file is packed, includes things like “debug build” |
App Name | 64 | ASCII name of the application |
App Version | 16 | ASCII version of the application |
App Author | 64 | ASCII name of the author |
App Description | 256 | ASCII description of the application |
Cover Image Data | 512x512x3 | RGB image data for the cover image |
Entry Point | 4 | Entry point for the program |
Code Size | 4 | Size of the code section |
Code Section | N | Compiled instructions |
Big Data Size | 4 | Size of the big data section |
Big Data Section | N | Contains the asset data |
#How do I build software for this console?
Writing machine code by hand is… well damn near impossible for anything sufficiently complex. I know back in the age of dinosaurs they would program binary using punch cards, but that was to do basic arithmetic and logic operations; not create a full piece of software, let alone a game with 3D and sound.
This is where language toolchains come in12. Toolchains are just a set of tools that work together to create software. Classically, these would be several tools that work together to process each stage of transformation from source code to machine code13.
Because I ultimately just want a textual version of the machine code with a handful of “nice-to-have” utilities14, our toolchain has an extremely simple lexer and parser that feeds directly into the assembler. Then the linker takes all the assembled files and combines them into a single binary file that can be loaded into the emulator.
While initially I had created the SDK using plain C and emulating the distributed toolchain of GCC15, I eventually moved to using Go to create a complete toolchain within a single CLI tool. This was far easier to write and use, allowing me to createe a TOML file with project meta information and a folder full of .rgs
files that would be compiled into an .rgx
file.
#Creating a custom assembly language
Now that we have an instruction set and a CPU to run them, it was time to create an assembly language that could be used to generate our machine code and produce executable files. Unoriginally, I called it Retrograde Assembly and gave it the file extension .rgs
16.
And yes, I created a TextMate grammar for it along with a VSCode extension to provide syntax highlighting and autocompletion.
; comments start with a semicolon and consume the rest of the line
my_byte: .byte 0x12, 0x13 ; Declare two bytes with the value 0x12 and 0x13
my_short: .2byte 0x1234 ; Declare a 16-bit "word" with the value 0x1234
my_int: .4byte 0x12345678 ; Declare a 32-bit int with the value 0x12345678
my_float: .float 3.14 ; Declare a 32-bit float with the value 3.14
my_string: .ascii "Hello, World!" ; Declare a string
my_string2: .asciz "Hello, World!" ; Declare a null-terminated string
my_bytes: .space 16 ; Allocate 16 bytes of space
my_embed: .incbin "file.bin" ; Embed a binary file with the size prepended (4 bytes)
main:
mov $1, $0 ; Clear $1 by moving 0 into it
load $1, my_int ; Load the address of `my_int` into $1
add $1, $1 ; Add the value of `my_int` to itself
push $1 ; Push the sum onto the stack
load $2, my_func ; Load the address of `my_func` into $2
call $2 ; Call the function `my_func`
pop $5 ; Pop our previous sum off the stack
add $1, $5 ; Add our previous sum to the result of `my_func`
halt ; Stop execution
my_func:
mul $1, $1 ; Multiply the value of $1 by itself
ret ; Return from the function
#Writing assembly by hand sucks
Since I was already this far in, I decided that creating a custom compiler for C wouldn’t be too big a stretch.
Um… yeah. I was wrong.
Creating a lexer and parser isn’t the most complex thing in the world — unless it’s a language like C where a variety of parsing strategies fall apart17. You also have to account for things like type checking, symbol resolution… oh, and don’t forget about the preprocessor!
This led to me making a partial compiler for a subset of C. Basically just enough to write “Hello World” and a few other simple programs that could some arithmetic, control flow, and function calls. It was slow and very clunky - but it worked!
#Where this project ended up
Like many of my projects, I was pulled away in order to make some money doing contract work. I was realizing that, while a very cool concept (to me), the amount of effort to build any kind of real software for this console was going to be immense.
I was already starting on the Conjure programming language at the time and considered making Retrograde a special target that my language could compile to. But after some time away, I realized that this version of the project had served its purpose for me.
I had a different idea of where I wanted to take the project and the purpose I wanted it to fulfill. This led to the current version of the project that I am slowly working on.