News:

As a consequence of the forum being updated and repaired, the chatbox has been lost.
However, you can still come say hi on our Discord server!

Main Menu

Camelot's C compiler

Started by tarpman, 10, January, 2021, 10:55:23 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

tarpman

Has anyone done any research into identifying or reproducing Camelot's C compiler?

I've been looking at agbcc, which I guess is supposed to be a reproduction of (or at least very close to) Nintendo's actual toolchain (apparently Cygnus GNUPro). The output is similar enough that I'm willing to believe Camelot's compiler is also a gcc, but there are some major differences too. These same patterns occur in Camelot's other games as well. I have not looked at many non-Camelot games, but the few I checked all look much more like agbcc.

I. Calls clobber r4

Camelot's code actually departs from the ARM standard by making r4 a call-clobbered (caller-save) register. This actually makes me think they modified the compiler themselves (or engaged Cygnus to do it for them) as I haven't found any commercial compiler offering this ability. Getting gcc/agbcc to do this is a trivial change in the Thumb config.

--- a/gcc/thumb.h
+++ b/gcc/thumb.h
@@ -405,7 +405,7 @@
#define CALL_USED_REGISTERS    \
{                              \
   1,1,1,1,                     \
-  0,0,0,0,                     \
+  1,0,0,0,                     \
   0,0,0,1,                     \
   1,1,1,1,1                    \
}


II. Register allocation order

I think it's fairly clear from reading the code that the low registers are allocated in reverse: r3, r2, r1, r0. This actually looks suspiciously similar to what agbcc's ARM compiler does. Prioritizing ip (r12) looks similar too, but it's not a straightforward comparison because r12 is a high register.

/* The order in which register should be allocated.  It is good to use ip
   since no saving is required (though calls clobber it) and it never contains
   function parameters.  It is quite good to use lr since other calls may
   clobber it anyway.  Allocate r0 through r3 in reverse order since r3 is
   least likely to contain a function parameter; in addition results are
   returned in r0.
   */
#define REG_ALLOC_ORDER      \
{                                   \
     3,  2,  1,  0, 12, 14,  4,  5, \
     6,  7,  8, 10,  9, 11, 13, 15, \
    16, 17, 18, 19, 20, 21, 22, 23, \
    24, 25, 26     \
}


agbcc's Thumb compiler does not define REG_ALLOC_ORDER at all, so the registers are just allocated in order.

III. Instruction scheduling

This one is harder to explain, or might be evidence that the compiler is not gcc/agbcc based.

agbcc's Thumb compiler does not support delay slots or function units for instruction scheduling. However, Camelot's compiler clearly has something of the sort. Here's a simple example, GS1 at 0x08004458:

ldr r1, =0x03001CB4
ldr r3, =0x41C64E6D
ldr r2, [r1]
add r0, r2, #0
mul r0, r3
ldr r3, =12345
add r0, r0, r3
str r0, [r1]
lsl r0, r0, #8
lsr r0, r0, #16
bx lr


My attempt at the corresponding C code, and agbcc's output (Compiler Explorer -- edited to inline the constants):

extern volatile unsigned rng_state;

unsigned short random(void) {
unsigned new_state = rng_state * 0x41C64E6D + 12345;
rng_state = new_state;
return new_state >> 8;
}


ldr r2, =0x03001CB4
ldr r1, [r2]
ldr r0, =0x41C64E6D
mul r0, r0, r1
ldr r1, =12345
add r0, r0, r1
str r0, [r2]
lsl r0, r0, #8
lsr r0, r0, #16
bx lr


The register allocation is reversed like I mentioned above (r2 → r1, r1 → r2, r0 → r3). Ignoring that, the interesting thing is the order of the first three instructions. You see that Camelot's compiler pipelined a second LDR while waiting for the first, but agbcc just lets the stall happen. This is a tiny example, but the same pattern exists throughout. Here's GS1 at 0x0808B25C:

push {r5, r6, lr}
ldr r2, =0x02000240
mov r3, #224
mov r12, r2
lsl r3, r3, #1
ldr r4, =0x0809E270
add r3, r12
mov r2, #0
ldrsh r0, [r3, r2]
ldmia r4!, {r2}


It consistently tries to insert at least one instruction (and I think it prefers two) after any load, before using the result. I don't think it's aware of memory regions -- it looks like it uses the same delay no matter whether the load is from IRAM, ERAM, or ROM.

IV. Strange literal pools

The above all make sense as performance improvements. This last one has me a bit stumped.

Usually a literal pool holds the literals in the order they're used. In some cases, though, a few literals are stored at the beginning of the pool instead. These early ones are also strange because sometimes they hold small numbers which could be encoded as immediates or synthesized -- sometimes even zero.

Here's an example from GS1 at 0x0800C1BA:

add r2, r6, #0
ldr r3, [pc, #4]
add r2, #84
strb r3, [r2, #0]
b 0x0800C230
.word 0
.word 0x0FFF


There's no need for a literal here -- mov r3, #0 is perfectly valid Thumb. The zero is also out of order: the second literal is referenced earlier, at 0x0800C172.

Using literal values this way has to be bad for performance. In this specific case the pool is nearby, but in most other cases it falls well outside of prefetch range (8 opcodes according to endrift). I assume, then, that these are compiled as placeholders, and are filled in by something else in the toolchain. The linker doesn't usually fill in values inside functions as far as I know, but if they modified the compiler, they could have modified the linker too...



Has anyone else done research in this area? Please post links!

Salanewt

Hey, it's very nice to see some research happening in this area!

I can't say I'm an expert on any of this sadly, so I can't really answer too many questions but I feel like GS2 is subtly different from the first for a lot of this. Have you looked into it at all?
Oh yeah baby, £ me harder.

Fusion is just a cheap tactic to make weak Adepts stronger.

Yoshi's Lighthouse is a hacking website in progress. Why not check it out if you like Yoshi or the Mario & Luigi games?

tarpman

#2
Quote from: Salanewt on 11, January, 2021, 06:44:45 PM
I feel like GS2 is subtly different from the first for a lot of this. Have you looked into it at all?

That's a very good question, and no, I've only skimmed GS2 at a very high level, haven't started disassembling it systematically yet.

Guessing from the indirect call tables (which are another mystery of their own), it looks like GS1's 0x08004458 is at 0x08014878 in GS2 (stub at 0x080000F8 for both). The code is almost, but not quite identical:

ldr r1, =0x030011BC
ldr r3, =0x41C64E6D
ldr r2, [r1]
add r0, r2, #0
mul r0, r3
mov r3, #0xC0
lsl r3, r3, #6
add r3, #0x39
add r0, r0, r3
str r0, [r1]
lsl r0, r0, #8
lsr r0, r0, #16
bx lr


That's interesting indeed, the second constant (the increment) is still 0x3039 (12345), but now it's synthesized instead of loaded. That looks like a performance improvement to me.

I couldn't find my second example in GS2 - the GS1 code I pasted is from the function that determines the battle background based on the current area, if you know where that is.

So, a different comparison. Here's the beginning of AgbMain, in GS1 it's at 0x08002E00:

push {r5, lr}
ldr r2, =0x40000B0
ldr r3, =0xC5FF
ldrh 1, [r2, #10]
and r3, r1
strh r3, [r2, #10]
ldr r3, =0x7FFF
ldrh r1, [r2, #10]
and r3, r1
strh r3, [r2, #10]
ldrh r3, [r2, #10]
sub sp, #4
ldr r2, =0x4014
ldr r3, =0x04000204
strh r2, [r3]
mov r0, sp
mov r5, #0
mov r1, #0xC0
str r5, [r0]
ldr r3, =0x040000D4
lsl r1, r1, #18
ldr r2, =0x85001E00
stmia r3!, {r0, r1, r2}
sub r3, #12


And the same code in GS2, at 0x0801319C:

push {r5, lr}
mov r2, #0x80
lsl r2, r2, #19
add r2, #0xB0
ldrh r1, [r2, #10]
mov r3, #0xC5
lsl r3, r3, #8
add r3, #0xFF
and r3, r1
strh r3, [r2, #10]
mov r3, #0xFE
ldrh r1, [r2, #10]
lsl r3, r3, #7
add r3, #0xFF
and r3, r1
strh r3, [r2, #10]
sub sp, #4
ldrh r3, [r2, #10]
mov r2, #0x80
ldr r3, =0x04000204
lsl r2, r2, #7
add r2, #20
strh r2, [r3]
mov r0, sp
mov r3, #0x80
mov r5, #0
lsl r3, r3, #19
mov r1, #0xC0
str r5, [r0]
add r3, #0xD4
lsl r1, r1, #18
ldr r2, =0x85001E00
stmia r3!, {r0, r1, r2}
sub r3, #12


The actual values are the same: (0x80<<19)+0xB0 is 0x040000B0, (0xC5<<8)+0xFF is 0xC5FF, and so on. But GS2 is definitely more aggressive about synthesizing the constants! I like it.

That function also has an example of one of the "strange literals" I mentioned. GS1, the value 0x0140 is at 0x08002E90 and the ldr is at 0x08002E54. GS2, the value is at 0x08013244 and the ldr is at 0x08013204. Both of them, the literal is at the beginning of the pool even though its ldr is later in the sequence.

Looks like all the other patterns I noticed are still there: r4 is still caller-save, r0-r3 are still allocated in reverse, loads are still delayed by ~2 instructions.

Any other interesting differences?

Daddy Poi's Oily Gorillas

#3
There are still some functions that push/pop r4... such as from 08016E0C in GS2. (Relating to Save Game mechanics)

And also same for the audio section. (081C0000) ... We know Camelot uses the Sappy Engine... The code here looks kinda weird at times, though.... so it could have been compiled separately. (Kinda like how DLLs are.)
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

tarpman

Quote from: Daddy Poi's Oily Gorillas on 11, January, 2021, 10:30:25 PM
There are still some functions that push/pop r4... such as from 08016E0C in GS2. (Relating to Save Game mechanics)

And also same for the audio section. (081C0000) ... We know Camelot uses the Sappy Engine... The code here looks kinda weird at times, though.... so it could have been compiled separately. (Kinda like how DLLs are.)

Yeah - thanks for bringing that up. The audio code doesn't exhibit any of the Camelot patterns, so I suspect you're right about it being built separately. Some of it could be handwritten assembly, but I don't think all of it is. I understand the sound library is third-party and shared with other games, so I wonder if it might have been provided as pre-built libraries instead of source code... I haven't been into the save/load code yet, so can't speak to that part.