Text compression

approxies · 30, October, 2014, 01:35:53 PM

Hello, I'm digging into different text compressions and found one in Golden Sun for GBA is quite interesting.
Sure, I know about Labmaster's doc and tool, but unpacking routine was just rewrited atop on asm code, and I'm trying to get the idea of compression.
Here's graph of 1 character decode. That's all pretty complicated but the part I don't understand is this one.
Hard to explain, but I'll try.
The function receives previously decoded char code. Each of this chars has some kind of path in recursive tree (which I wanted to discuss here), so actual source text bitstream just decides either to parse tree one round or emit char.
Before each tree there's codes of characters, which can be put after this previously decoded char. So one tree parse returns offset-back to appropriate character. Source text bitstream just decides if you need to parse one more time and increase this offset-back.
Now for path in tree: if you go right (1), you get out of parsing, if you get left(0), you go into recursive tree, so you'll have to meet more 1s to get out of parsing and so on. At the end of parsing you have to count how much 1's did you get and that will be offset-back.
Source bitstream can also skip some bits from the beginning of this tree path, so you can set a root in any place of this path, thus changing result offset-back.

For me all this path in recursive tree storage is overcomplicated and unoptimal (not saying, I don't understand it at all). But I'm sure there should be some simple explanation how this scheme works.
Maybe it's modification of well known algorithms or someone has seen this kind of tree in previous?

Atrius · 30, October, 2014, 07:01:16 PM

It's been a few years, and I can't seem to find my notes on it right now. Most of that sounds familiar, but I think there might be something off with that explanation, maybe I'm remembering wrong or misunderstanding you, but I think there was more than the number of ones to the offset of the returned character...

Anyway, I know basing the compression of one character based on the previous one is fairly common in text compression, but the whole recursive tree thing had me stumped too when I was working on it.

approxies · 30, October, 2014, 11:14:39 PM

Quote from: Atrius on 30, October, 2014, 07:01:16 PM
Anyway, I know basing the compression of one character based on the previous one is fairly common in text compression, but the whole recursive tree thing had me stumped too when I was working on it.

Compression based on previous character is pretty much understandable: there are characters which goes after appropriate characters with more possibilities, thus making it possible to encode them in a less bit quantity.
But all that tree thing is pretty strange. Did you finally got algorithm understanding?

approxies · 12, January, 2015, 12:52:29 PM

Hello, Mr. Atrius, sorry for necroposting.
All that time I was trying to wrap my head around compression, asked a few of my colleagues and created asked at couple other places, but no luck.
The compression scheme is still very interesting for me, but it appears I don't understand a core idea of compression.

I know it'll take you some efforts to remember details, so is it possible that you kindly share a piece of code of your golden sun tool, which is responsible for text compression?

I'm just writing sometimes some romhacking tutorials and not interested in GS hacking itself, so I promise to use that code only for educational purporses.

Atrius · 12, January, 2015, 05:51:20 PM

I really don't have a good grasp on the text compression in Golden Sun at all. As I recall my compression algorithm essentially brute forced it's way through the recursive trees since I never managed to get a full understanding of how exactly it all worked.

approxies · 13, January, 2015, 12:16:01 AM

Quote from: Atrius on 12, January, 2015, 05:51:20 PM
As I recall my compression algorithm essentially brute forced it's way through the recursive trees

Oh, I see now the situation. Thank you anyway for response. I hope to figure that out sometime.

Daddy Poi's Oily Gorillas · 13, January, 2015, 02:58:12 AM

QuoteI know it'll take you some efforts to remember details, so is it possible that you kindly share a piece of code of your golden sun tool, which is responsible for text compression?

I'm just writing sometimes some romhacking tutorials and not interested in GS hacking itself, so I promise to use that code only for educational purporses.

That makes it sound like you didn't know the GS Editor was open-sourced. Well, it is... :) (Only his last version, though.)

I don't really understand it myself, either... and what I did see a long time ago... in his code for text decompression... made it look like he copied what he saw in the assembly.

approxies · 13, January, 2015, 08:55:35 AM

Quote from: Fox on 13, January, 2015, 02:58:12 AM
That makes it sound like you didn't know the GS Editor was open-sourced. Well, it is... :) (Only his last version, though.)

Thanks for letting me know! I really didn't know this.

approxies · 24, January, 2015, 01:38:11 AM

Quote from: approxies on 30, October, 2014, 01:35:53 PM
Here's graph of 1 character decode. That's all pretty complicated but the part I don't understand is this one.
Hard to explain, but I'll try.
For me all this path in recursive tree storage is overcomplicated and unoptimal (not saying, I don't understand it at all). But I'm sure there should be some simple explanation how this scheme works.
Maybe it's modification of well known algorithms or someone has seen this kind of tree in previous?

I've figured all out once I've constructed tree and tried to crawl in it by hands. In general, iteration bitstream is a binary tree, serialized in pre-order traversal. Source stream is a plain path in that tree. Whole compression scheme is a Huffman with individual tree for each char.
Thanks to everybody, who was involved.

Atrius · 30, January, 2015, 08:07:22 PM

I had a feeling it was some form of Huffman compression, but couldn't make sense of how the trees were working. Thank you, I'm glad someone was able to figure it out.

Daddy Poi's Oily Gorillas · 30, January, 2015, 09:40:53 PM

Quotewith individual tree for each char.

Quotebut couldn't make sense of how the trees were working.

So... Does this mean that Atrius's compression format needs fixing? Or is it still okay? (I haven't actually looked into that.)

Do we even have a list of char combinations that don't work, and/or char combinations that bring the best compression/etc.?

Atrius · 30, January, 2015, 09:54:29 PM

My compression algorithm is essentially using a brute force method, meaning that it keeps trying different combinations until it finds one that works. It will most probably get the best compression ratio since it starts with small combinations and works its way up, and since it tries every combination until it overflows it should find any that actually work. There's nothing wrong with it in that regard, but it can be optimized to run faster.

Arguably, the most important thing properly understanding the format would enable is rewriting the Huffman trees to allow the use of more character combinations. As it is now, each tree only contains the other characters that are needed for the game's current dialogue. This makes it impossible to follow certain characters with certain other characters.

Daddy Poi's Oily Gorillas · 30, January, 2015, 10:04:44 PM

Yeah, you said something about brute-force in an earlier post, I just wasn't sure if it took into account every possibility.

I get the feeling the order of the characters are probably not random, and that there's a chance all strings are compared before the trees are created, but who knows? (Perhaps we can compare them with other versions of Golden Sun? Like GS1 to GS2?) If they're the same, then I guess I can count this out.

I also wonder if there were any systems where like: A debug compile may not compress data, but a Release compile might, but who knows... And this is only theoretical. (I actually doubt that it would be useful, though.)

Atrius · 30, January, 2015, 10:19:42 PM

I can't theorize about their compile settings since all we have to go off of is the release version, but I can say most definitely that the trees are not random. At the very least they're ordered specifically to achieve a good compression ratio (this is the primary purpose of Huffman trees). I can't say for certain if EVERY string in the game was analyzed to see what characters follow other characters most often, but it wouldn't surprise me. It wouldn't have been terribly computationally expensive to do compared to other things going on during a compile. I've actually written my own Huffman compression algorithm before, I remember using huge chat logs to generate the Huffman tables. It took longer to read the chat logs from a hard drive than it did to count how often certain characters follow other characters in them. Assuming a 100Mb disk read speed (A perfectly reasonable speed even for slower drives these days) it averaged to counting over 10 Million English words (~5 characters per word) each second.

Daddy Poi's Oily Gorillas · 30, January, 2015, 10:45:03 PM

QuoteI can't theorize about their compile settings since all we have to go off of is the release version

I suppose one way of artificially (fake) supporting the theory is seeing how much space is taken up in the ROM when everything is uncompressed.

QuoteIt took longer to read the chat logs from a hard drive than it did to count how often certain characters follow other characters in them.

Do you mean "read" as in loading the data to a byte array via an I/O command?

Locations of the two pointers used for decompressing...
0803842C = GS1
08060C30 = GS2

GS1:
First pointer points to 0x4E011048.
Second pointer points to 0x00C80057.

GS2:
First pointer points to 0x4C048057.
Second pointer points to 0x00E3007B.

So maybe there's a chance I'm right? :)

Atrius · 30, January, 2015, 10:54:10 PM

"read" as in the I/O operations to retrieve the data from the hard drive to RAM, yes.

As for the trees being different per game, I can guarantee 100% they had to be changed when translating the games from Japanese to English and there's no reason they shouldn't have been recalculated based on the dialogue in the game. It would have been computationally inexpensive in the grand scheme of things, probably necessary to achieve the format Golden Sun's engine uses for the Huffman trees, and the only way to get the most efficient compression ratio possible.

QuoteI suppose one way of artificially (fake) supporting the theory is seeing how much space is taken up in the ROM when everything is uncompressed.

Remember the text editor GStoolkit? It did exactly that with the text. I don't recall off the top of my head exactly how much the size differed, but I do recall that the ROM had to be expanded (Increased in file size) to fit all of the text uncompressed.

Daddy Poi's Oily Gorillas · 30, January, 2015, 11:06:33 PM

So it is as I thought. :)

GS1:
08037464 (0xED0 bytes)
08038334 (0xF8 bytes)
0803842C

GS2:
0805F914 (0x1138 bytes)
08060A4C (0x1E4 bytes)
08060C30

Anyway, it looks like the tables are different lengths between the games! Interesting, isn't it?

--

@Text editor GStoolkit: Don't forget I was also talking about other compressed data as well... See map data.. (Assuming Text Compression wouldn't be the only difference between Debug and Release.) If everything still fits in 32 MB (Or whatever size a debug emulator would be limited to.) Then all is good, I guess. When a Release version is made, you'd want the smallest catridge to save cost in the end, so...

approxies · 02, August, 2015, 02:35:11 PM

Sorry for necroposting, but I've written a text compressor/decompressor, which works OK both with GS and GS2.
Unfortunately, for some reason it compresses 5 bytes per whole script more. Which is a problem, as there is a 'alphabet' code right after text block in ROM.

As I wanted to write really efficient tool, could someone give a hand here? I will surely expalain everything I know.

Daddy Poi's Oily Gorillas · 02, August, 2015, 11:34:19 PM

Sounds nice. What do you have so far?

I managed to create a decompression routine in C# back around April/May(?) Although I do have one version of my decompression on a pastebin, this is a newer version as it stands in the project right now. (The newer one decompresses all strings instead of individual ones.)
The box below doesn't show everything, since you still need to load the rom in a byte array for the argument called src... (Using ReadAllBytes/WriteAllBytes should do fine.), and make some functions in a class I called Bits, etc... (I'm sure you can tell it is a WIP since GS2-specific addresses are in there with no compatibility to GS as long as it stays this way.)

Code Select

        byte[] decompText(byte[] src) { //, int srcInd) { // int srcPos) {
            byte[] des = new byte[0x800000]; int desEntry = 0, desPos = 0xC300;
            for (int srcI = 0; srcI < 12461; srcI++) {
                Bits.setInt32(des, desEntry, desPos - 0xC300); desEntry += 4;
                int srcInd = srcI;
                int textTree = Bits.getInt32(src, 0xA9F54 + ((srcInd >> 8) << 3)) - 0x08000000;
                int textLenAddr = Bits.getInt32(src, 0xA9F54 + ((srcInd >> 8) << 3) + 4) - 0x08000000;
                srcInd &= 0xFF;
                while (srcInd-- != 0) {
                    int cLen;
                    do {
                        cLen = src[textLenAddr++];
                        textTree += cLen;
                    } while (cLen == 0xFF);
                }
                int initChar = 0;
                int chardata = Bits.getInt32(src, 0x60C30) - 0x08000000;
                int charpntrs = Bits.getInt32(src, 0x60C34) - 0x08000000;
                textTree <<= 3;
                do {
                    int charTree = (chardata + Bits.getInt16(src, charpntrs + (initChar << 1))) << 3;
                    int charSlot = charTree - 12;
                    while (((src[charTree >> 3] >> (charTree++ & 7)) & 1) == 0) {
                        if (((src[textTree >> 3] >> (textTree++ & 7)) & 1) == 1) {
                            int depth = 0;
                            while (depth >= 0) {
                                while (((src[charTree >> 3] >> (charTree++ & 7)) & 1) == 0) {
                                    depth++;
                                }
                                charSlot -= 12;
                                depth--;
                            }
                        }
                    }
                    initChar = (Bits.getInt16(src, charSlot >> 3) >> (charSlot & 7)) & 0xFFF;
                    des[desPos++] = (byte)initChar;
                    //do {
                    // n=getNextCharacter(argument0)
                    // if n!=0 {
                    //  if (n<32 || n>ord('~')) {
                    //   if argument2
                    //   { str+='['+string(n)+']'; }
                    //   if argument3 && (n==1 || n==3) && (p<17 || p>20) && p!=26 && p!=29
                    //   { n=0 }
                    //  } else { str+=chr(n); }
                    // }
                    // p=n
                    //} until n=0
                } while (initChar != 0);
            }
            return des;
        }

As for compressing itself, it's a bit more confusing if you are generating the char tables as well. (Which should be done anyway, for best compression of this compression format.)
(After looking into some huffman compression guides back in the day, those "next chars" in the char tables really are in a particular order... if you compare it with the huffman tree listing frequency values that go from least to greatest, it should be easy enough to tell.)

My idea for compressing was:
-Have a huge table... go through all the letters and lncrement the frequency values in the table... as well as list the char combos used in another table... Both to help with looking up these frequencies (maybe), but more importantly to keep the order we want when two may have the same frequency.) (Since I think that is second priority in the tree list.)
-Somehow generate a tree using that data
-Navigate your tree and make the char tables to get inserted into the ROM
-Compress the text. (Might want to first create tables for the bit codes/bit amounts... of next char... so that the decompressed text can work as indexes to those values in those tables.)

Or at least, THAT is what I was thinking, but I never completed thee compression format itself.

-- Basically, I coded maybe the first bit of it, but then somewhere around there stopped, and haven't worked on it since. - Here is the way it stands, in its incomplete state. (Posting for information purposes.)
A little [warning], but it is quite messy./There may even be a chance I mis-coded something.
[spoiler]short[] freq = new short[0x10000]; //We need to know how often each character combination occurs to determine best bit amounts.
short[] clen = new short[0x100]; //How many chars each char has associated with it.
short[] clst = new short[0x10000]; //Char list in the order they appear in the text.
public void comptext(byte[] src) {
//Generate char tables!
//Scan text to generate frequency table.
short char1 = 0, char2 = 0;
//short[] freq = new short[0x10000]; //We need to know how often each character combination occurs to determine best bit amounts.
//short[] clen = new short[0x100]; //How many chars each char has associated with it.
//short[] clst = new short[0x10000]; //Char list in the order they appear in the text.
int srcEntry = 0;
while (Bits.getInt32(src, srcEntry) != 0) { //Set up frequency table and char list (in order displayed in text.)
int srcPos = 0xC300 + Bits.getInt32(src, srcEntry);
do {
char2 = src[srcPos++];
if (freq[char1 * 0x100 + char2]++ == 0) {
clst[char1 * 0x100 + clen[char1]++] = char2; //clen[char1]++;// += 1;
}
//freq[char1 * 0x100 + char2] += 1;
char1 = char2;
} while (char1 != 0); //Change to while < textArray?-- No, go through offset list instead.
srcEntry += 4;
}
//Make list that points from least frequent char to most frequent char.
int[] cltg = new int[0x10000]; //short[] cltg = new short[0x10000]; //Chars least to greatest - Swap to 0x200(?)(Esp. if contains node info) bytes and use same memory for all chars. (Since don't need them simutaneously.)
//for (int charScan = 0; charScan < 0x100; charScan++) {
for (int j = 0; j < 0x0100; j++) {
if (clen[j] <= 1) { continue; }
int wgtMin = 1, wgt1 = 0xffff, wgt2 = 0xffff;//, wfInd = 0;
wgt1 = getWgt(j, ref wgtMin);
wgt2 = getWgt(j, ref wgtMin);
int ApdInd = 0, nodeInd = 0, nodeFreq = 0;
//Auto: wgt1 <= wgt2 ; wgts ? nodeFreqs ; NodeFreq <= NodeFreq2
cltg[ApdInd++] = wgt1;
cltg[ApdInd++] = wgt2;
wgt1 = getWgt(j, ref wgtMin);
wgt2 = getWgt(j, ref wgtMin);
//nodeInd = 0;
nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
int nodeFreq2 = 0xffff; // cltg[nodeInd++] + cltg[nodeInd++];
for (int loops = 0; loops < clen[j] - 3; loops++) { //-1=default + -2=already put in 2 freqs.
if (wgt2 <= nodeFreq) {
cltg[ApdInd++] = wgt1;
cltg[ApdInd++] = wgt2;
wgt1 = getWgt(j, ref wgtMin);
wgt2 = getWgt(j, ref wgtMin);
if (nodeFreq2 == 0xFFFF) {
nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
}
} else if (wgt1 <= nodeFreq) {
cltg[ApdInd++] = wgt1;
cltg[ApdInd++] = nodeFreq;
wgt1 = wgt2; wgt2 = getWgt(j, ref wgtMin);
if (nodeFreq2 == 0xFFFF) {
nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
} else {
nodeFreq = nodeFreq2;
nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
}
} else if (nodeFreq2 < wgt1) {
cltg[ApdInd++] = nodeFreq;
cltg[ApdInd++] = nodeFreq2;
//if (nodeFreq2 != 0xFFFF) { nodeFreq = nodeFreq2; }
nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
if (nodeInd < ApdInd) {
nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
} else {
nodeFreq2 = 0xFFFF;
}
//nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
} else if (nodeFreq < wgt1) { //(nodeFreq < wgt1)
cltg[ApdInd++] = nodeFreq;
cltg[ApdInd++] = wgt1;
if (nodeFreq2 == 0xFFFF) {
nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
} else {
nodeFreq = nodeFreq2;
nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
}
wgt1 = wgt2; wgt2 = getWgt(j, ref wgtMin);
}
//if (nodeFreq2 == 0xFFFF) {
// nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
//}
} //end for clen
//if (nodeFreq2 != 0xFFFF) { nodeFreq = nodeFreq2; }
}
//int first = 0, last = 0, prev = 0, next = 0, i = 0;
//int len = 1;
//while (i != last) { //Note: When a number points to itself, it is the largest?
// int curfreq = freq[charScan * 0x100 + clst[charScan * 0x100 + i]];
// if (curfreq < freq[charScan * 0x100 + clst[charScan * 0x100 + prev]]) {

// } else if (curfreq < freq[charScan * 0x100 + clst[charScan * 0x100 + next]]) {

// }
//}
//}

//Node table with weight/freq values and pointer info....

//Make Char tables to be put in ROM.

//Compress text.
}
private int getWgt(int initChar, ref int wgtMin) {
int theWgt = 0xFFFF, wfInd = 0;
for (int i = 0; i < clen[initChar]; i++) {
int aWgt = freq[initChar * 0x100 + clst[initChar * 0x100 + i]];
if (wgtMin <= aWgt) {
if (aWgt < theWgt) {
theWgt = aWgt;
wfInd = initChar * 0x100 + clst[initChar * 0x100 + i];
}
}
}
wgtMin = theWgt;
freq[wfInd] = 0; //Temp?
return theWgt;
}[/spoiler]

approxies · 03, August, 2015, 02:23:48 PM

Due to compression logic complexity, I've written my tool in Haskell. You can find it's core here
Up to the line 159 goes decompression code, after line 168 goes compression. The code is commented all right, but general idea for compression is:
1.Build histogram of all chars, which goes after char 0 and build huffman tree out of it with classic merging nodes procedure
2.Build histogram of all chars, which goes after char 1 and build huffman tree out of it with classic merging nodes procedure
...
X. Build histogram of all chars which goes after char FF and build huffman tree out of it with classic merging nodes procedure
Now we have 0x100 huffman trees for each character in script - compress each char by corresponding Huffman tree.
Then do the serialization and dump binaries for pasting these blocks in ROM.

Generally, I can write a fully functional tool in couple of hours: it will be possible to decompress any message or whole script into binary or ASCII; or compress script into binaries ready to be inserted in ROM.
The problem stays the same: compressed text blocks are 5 bytes larger than original compressed block. Trees and pointer blocks are same in size. Sometimes tree varies from original state, but overall compression efficiency should stay the same. The fact that it don't: sometimes my messages are 1 byte shorter, but +5 times they are 1 byte longer. So the main problem is how can I track down why game's compression excels mine.

News:

Text compression

approxies

Atrius (He/Him)

approxies

approxies

Atrius (He/Him)

approxies

Daddy Poi's Oily Gorillas

approxies

approxies

Atrius (He/Him)

Daddy Poi's Oily Gorillas

Atrius (He/Him)

Daddy Poi's Oily Gorillas

Atrius (He/Him)

Daddy Poi's Oily Gorillas

Atrius (He/Him)

Daddy Poi's Oily Gorillas

approxies

Daddy Poi's Oily Gorillas

approxies