News:

The forum has been updated to SMF (2.1.3)!
Please be patient as we work to polish up the place and update features as we can.

Main Menu

Text compression

Started by approxies, 30, October, 2014, 01:35:53 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

approxies

Hello, I'm digging into different text compressions and found one in Golden Sun for GBA is quite interesting.
Sure, I know about Labmaster's doc and tool, but unpacking routine was just rewrited atop on asm code, and I'm trying to get the idea of compression.
Here's graph of 1 character decode. That's all pretty complicated but the part I don't understand is this one.
Hard to explain, but I'll try.
The function receives previously decoded char code. Each of this chars has some kind of path in recursive tree (which I wanted to discuss here), so actual source text bitstream just decides either to parse tree one round or emit char.
Before each tree there's codes of characters, which can be put after this previously decoded char. So one tree parse returns offset-back to appropriate character. Source text bitstream just decides if you need to parse one more time and increase this offset-back.
Now for path in tree: if you go right (1), you get out of parsing, if you get left(0), you go into recursive tree, so you'll have to meet more 1s to get out of parsing and so on. At the end of parsing you have to count how much 1's did you get  and that will be offset-back.
Source bitstream can also skip some bits from the beginning of this tree path, so you can set a root in any place of this path, thus changing result offset-back.

For me all this path in recursive tree storage is overcomplicated and unoptimal (not saying, I don't understand it at all). But I'm sure there should be some simple explanation how this scheme works.
Maybe it's modification of well known algorithms or someone has seen this kind of tree in previous?

Atrius (He/Him)

It's been a few years, and I can't seem to find my notes on it right now.  Most of that sounds familiar, but I think there might be something off with that explanation, maybe I'm remembering wrong or misunderstanding you, but I think there was more than the number of ones to the offset of the returned character...

Anyway, I know basing the compression of one character based on the previous one is fairly common in text compression, but the whole recursive tree thing had me stumped too when I was working on it.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

approxies

Quote from: Atrius on 30, October, 2014, 07:01:16 PM
Anyway, I know basing the compression of one character based on the previous one is fairly common in text compression, but the whole recursive tree thing had me stumped too when I was working on it.
Compression based on previous character is pretty much understandable: there are characters which goes after appropriate characters with more possibilities, thus making it possible to encode them in a less bit quantity.
But all that tree thing is pretty strange. Did you finally got algorithm understanding?

approxies

#3
Hello, Mr. Atrius, sorry for necroposting.
All that time I was trying to wrap my head around compression, asked a few of my colleagues and created asked at couple other places, but no luck.
The compression scheme is still very interesting for me, but it appears I don't understand a core idea of compression.

I know it'll take you some efforts to remember details, so is it possible that you kindly share a piece of code of your golden sun tool, which is responsible for text compression?

I'm just writing sometimes some romhacking tutorials and not interested in GS hacking itself, so I promise to use that code only for educational purporses.

Atrius (He/Him)

I really don't have a good grasp on the text compression in Golden Sun at all.  As I recall my compression algorithm essentially brute forced it's way through the recursive trees since I never managed to get a full understanding of how exactly it all worked.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

approxies

Quote from: Atrius on 12, January, 2015, 05:51:20 PM
As I recall my compression algorithm essentially brute forced it's way through the recursive trees

Oh, I see now the situation. Thank you anyway for response. I hope to figure that out sometime.

Daddy Poi's Oily Gorillas

#6
QuoteI know it'll take you some efforts to remember details, so is it possible that you kindly share a piece of code of your golden sun tool, which is responsible for text compression?

I'm just writing sometimes some romhacking tutorials and not interested in GS hacking itself, so I promise to use that code only for educational purporses.
That makes it sound like you didn't know the GS Editor was open-sourced. Well, it is... :) (Only his last version, though.)

I don't really understand it myself, either... and what I did see a long time ago... in his code for text decompression... made it look like he copied what he saw in the assembly.
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Quote from: Fox on 13, January, 2015, 02:58:12 AM
That makes it sound like you didn't know the GS Editor was open-sourced. Well, it is... :) (Only his last version, though.)
Thanks for letting me know! I really didn't know this.

approxies

Quote from: approxies on 30, October, 2014, 01:35:53 PM
Here's graph of 1 character decode. That's all pretty complicated but the part I don't understand is this one.
Hard to explain, but I'll try.
For me all this path in recursive tree storage is overcomplicated and unoptimal (not saying, I don't understand it at all). But I'm sure there should be some simple explanation how this scheme works.
Maybe it's modification of well known algorithms or someone has seen this kind of tree in previous?

I've figured all out once I've constructed tree and tried to crawl in it by hands. In general, iteration bitstream is a binary tree, serialized in pre-order traversal. Source stream is a plain path in that tree. Whole compression scheme is a Huffman with individual tree for each char.
Thanks to everybody, who was involved.

Atrius (He/Him)

I had a feeling it was some form of Huffman compression, but couldn't make sense of how the trees were working.  Thank you, I'm glad someone was able to figure it out.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

Daddy Poi's Oily Gorillas

Quotewith individual tree for each char.

Quotebut couldn't make sense of how the trees were working.
So... Does this mean that Atrius's compression format needs fixing? Or is it still okay? (I haven't actually looked into that.)

Do we even have a list of char combinations that don't work, and/or char combinations that bring the best compression/etc.?
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

Atrius (He/Him)

#11
My compression algorithm is essentially using a brute force method, meaning that it keeps trying different combinations until it finds one that works.  It will most probably get the best compression ratio since it starts with small combinations and works its way up, and since it tries every combination until it overflows it should find any that actually work.  There's nothing wrong with it in that regard, but it can be optimized to run faster.

Arguably, the most important thing properly understanding the format would enable is rewriting the Huffman trees to allow the use of more character combinations.  As it is now, each tree only contains the other characters that are needed for the game's current dialogue.  This makes it impossible to follow certain characters with certain other characters.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

Daddy Poi's Oily Gorillas

#12
Yeah, you said something about brute-force in an earlier post, I just wasn't sure if it took into account every possibility.

I get the feeling the order of the characters are probably not random, and that there's a chance all strings are compared before the trees are created, but who knows? (Perhaps we can compare them with other versions of Golden Sun? Like GS1 to GS2?) If they're the same, then I guess I can count this out.

I also wonder if there were any systems where like: A debug compile may not compress data, but a Release compile might, but who knows... And this is only theoretical. (I actually doubt that it would be useful, though.)
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

Atrius (He/Him)

#13
I can't theorize about their compile settings since all we have to go off of is the release version, but I can say most definitely that the trees are not random.  At the very least they're ordered specifically to achieve a good compression ratio (this is the primary purpose of Huffman trees).  I can't say for certain if EVERY string in the game was analyzed to see what characters follow other characters most often, but it wouldn't surprise me.  It wouldn't have been terribly computationally expensive to do compared to other things going on during a compile.  I've actually written my own Huffman compression algorithm before, I remember using huge chat logs to generate the Huffman tables.  It took longer to read the chat logs from a hard drive than it did to count how often certain characters follow other characters in them.  Assuming a 100Mb disk read speed (A perfectly reasonable speed even for slower drives these days) it averaged to counting over 10 Million English words (~5 characters per word) each second.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

Daddy Poi's Oily Gorillas

QuoteI can't theorize about their compile settings since all we have to go off of is the release version
I suppose one way of artificially (fake) supporting the theory is seeing how much space is taken up in the ROM when everything is uncompressed.

QuoteIt took longer to read the chat logs from a hard drive than it did to count how often certain characters follow other characters in them.
Do you mean "read" as in loading the data to a byte array via an I/O command?

Locations of the two pointers used for decompressing...
0803842C = GS1
08060C30 = GS2

GS1:
First pointer points to 0x4E011048.
Second pointer points to 0x00C80057.

GS2:
First pointer points to 0x4C048057.
Second pointer points to 0x00E3007B.

So maybe there's a chance I'm right? :)
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

Atrius (He/Him)

#15
"read" as in the I/O operations to retrieve the data from the hard drive to RAM, yes.

As for the trees being different per game, I can guarantee 100% they had to be changed when translating the games from Japanese to English and there's no reason they shouldn't have been recalculated based on the dialogue in the game.  It would have been computationally inexpensive in the grand scheme of things, probably necessary to achieve the format Golden Sun's engine uses for the Huffman trees, and the only way to get the most efficient compression ratio possible.


QuoteI suppose one way of artificially (fake) supporting the theory is seeing how much space is taken up in the ROM when everything is uncompressed.
Remember the text editor GStoolkit?  It did exactly that with the text.  I don't recall off the top of my head exactly how much the size differed, but I do recall that the ROM had to be expanded (Increased in file size) to fit all of the text uncompressed.
[sprite=220,4,0]I'm shaking my head in general disapproval of everything[/sprite]

Daddy Poi's Oily Gorillas

#16
So it is as I thought. :)

GS1:
08037464 (0xED0 bytes)
08038334 (0xF8 bytes)
0803842C

GS2:
0805F914 (0x1138 bytes)
08060A4C (0x1E4 bytes)
08060C30

Anyway, it looks like the tables are different lengths between the games!  Interesting, isn't it?


--

@Text editor GStoolkit: Don't forget I was also talking about other compressed data as well... See map data.. (Assuming Text Compression wouldn't be the only difference between Debug and Release.) If everything still fits in 32 MB (Or whatever size a debug emulator would be limited to.) Then all is good, I guess. When a Release version is made, you'd want the smallest catridge to save cost in the end, so...
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Sorry for necroposting, but I've written a text compressor/decompressor, which works OK both with GS and GS2.
Unfortunately, for some reason it compresses 5 bytes per whole script more. Which is a problem, as there is a 'alphabet' code right after text block in ROM.

As I wanted to write really efficient tool, could someone give a hand here? I will surely expalain everything I know.

Daddy Poi's Oily Gorillas

#18
Sounds nice. What do you have so far?

I managed to create a decompression routine in C# back around April/May(?) Although I do have one version of my decompression on a pastebin, this is a newer version as it stands in the project right now. (The newer one decompresses all strings instead of individual ones.)
The box below doesn't show everything, since you still need to load the rom in a byte array for the argument called src... (Using ReadAllBytes/WriteAllBytes should do fine.), and make some functions in a class I called Bits, etc... (I'm sure you can tell it is a WIP since GS2-specific addresses are in there with no compatibility to GS as long as it stays this way.)
        byte[] decompText(byte[] src) { //, int srcInd) { // int srcPos) {
            byte[] des = new byte[0x800000]; int desEntry = 0, desPos = 0xC300;
            for (int srcI = 0; srcI < 12461; srcI++) {
                Bits.setInt32(des, desEntry, desPos - 0xC300); desEntry += 4;
                int srcInd = srcI;
                int textTree = Bits.getInt32(src, 0xA9F54 + ((srcInd >> 8) << 3)) - 0x08000000;
                int textLenAddr = Bits.getInt32(src, 0xA9F54 + ((srcInd >> 8) << 3) + 4) - 0x08000000;
                srcInd &= 0xFF;
                while (srcInd-- != 0) {
                    int cLen;
                    do {
                        cLen = src[textLenAddr++];
                        textTree += cLen;
                    } while (cLen == 0xFF);
                }
                int initChar = 0;
                int chardata = Bits.getInt32(src, 0x60C30) - 0x08000000;
                int charpntrs = Bits.getInt32(src, 0x60C34) - 0x08000000;
                textTree <<= 3;
                do {
                    int charTree = (chardata + Bits.getInt16(src, charpntrs + (initChar << 1))) << 3;
                    int charSlot = charTree - 12;
                    while (((src[charTree >> 3] >> (charTree++ & 7)) & 1) == 0) {
                        if (((src[textTree >> 3] >> (textTree++ & 7)) & 1) == 1) {
                            int depth = 0;
                            while (depth >= 0) {
                                while (((src[charTree >> 3] >> (charTree++ & 7)) & 1) == 0) {
                                    depth++;
                                }
                                charSlot -= 12;
                                depth--;
                            }
                        }
                    }
                    initChar = (Bits.getInt16(src, charSlot >> 3) >> (charSlot & 7)) & 0xFFF;
                    des[desPos++] = (byte)initChar;
                    //do {
                    // n=getNextCharacter(argument0)
                    // if n!=0 {
                    //  if (n<32 || n>ord('~')) {
                    //   if argument2
                    //   { str+='['+string(n)+']'; }
                    //   if argument3 && (n==1 || n==3) && (p<17 || p>20) && p!=26 && p!=29
                    //   { n=0 }
                    //  } else { str+=chr(n); }
                    // }
                    // p=n
                    //} until n=0
                } while (initChar != 0);
            }
            return des;
        }



As for compressing itself, it's a bit more confusing if you are generating the char tables as well. (Which should be done anyway, for best compression of this compression format.)
(After looking into some huffman compression guides back in the day, those "next chars" in the char tables really are in a particular order... if you compare it with the huffman tree listing frequency values that go from least to greatest, it should be easy enough to tell.)

My idea for compressing was:
-Have a huge table... go through all the letters and lncrement the frequency values in the table... as well as list the char combos used in another table... Both to help with looking up these frequencies (maybe), but more importantly to keep the order we want when two may have the same frequency.) (Since I think that is second priority in the tree list.)
-Somehow generate a tree using that data
-Navigate your tree and make the char tables to get inserted into the ROM
-Compress the text. (Might want to first create tables for the bit codes/bit amounts... of next char... so that the decompressed text can work as indexes to those values in those tables.)

Or at least, THAT is what I was thinking, but I never completed thee compression format itself.

-- Basically, I coded maybe the first bit of it, but then somewhere around there stopped, and haven't worked on it since. - Here is the way it stands, in its incomplete state. (Posting for information purposes.)
A little [warning], but it is quite messy./There may even be a chance I mis-coded something.
[spoiler]short[] freq = new short[0x10000]; //We need to know how often each character combination occurs to determine best bit amounts.
        short[] clen = new short[0x100]; //How many chars each char has associated with it.
        short[] clst = new short[0x10000]; //Char list in the order they appear in the text.
        public void comptext(byte[] src) {
            //Generate char tables!
            //Scan text to generate frequency table.
            short char1 = 0, char2 = 0;
            //short[] freq = new short[0x10000]; //We need to know how often each character combination occurs to determine best bit amounts.
            //short[] clen = new short[0x100]; //How many chars each char has associated with it.
            //short[] clst = new short[0x10000]; //Char list in the order they appear in the text.
            int srcEntry = 0;
            while (Bits.getInt32(src, srcEntry) != 0) { //Set up frequency table and char list (in order displayed in text.)
                int srcPos = 0xC300 + Bits.getInt32(src, srcEntry);
                do {
                    char2 = src[srcPos++];
                    if (freq[char1 * 0x100 + char2]++ == 0) {
                        clst[char1 * 0x100 + clen[char1]++] = char2; //clen[char1]++;// += 1;
                    }
                    //freq[char1 * 0x100 + char2] += 1;
                    char1 = char2;
                } while (char1 != 0); //Change to while < textArray?-- No, go through offset list instead.
                srcEntry += 4;
            }
            //Make list that points from least frequent char to most frequent char.
            int[] cltg = new int[0x10000]; //short[] cltg = new short[0x10000]; //Chars least to greatest - Swap to 0x200(?)(Esp. if contains node info) bytes and use same memory for all chars. (Since don't need them simutaneously.)
            //for (int charScan = 0; charScan < 0x100; charScan++) {
                for (int j = 0; j < 0x0100; j++) {
                    if (clen[j] <= 1) { continue; }
                    int wgtMin = 1, wgt1 = 0xffff, wgt2 = 0xffff;//, wfInd = 0;
                    wgt1 = getWgt(j, ref wgtMin);
                    wgt2 = getWgt(j, ref wgtMin);
                    int ApdInd = 0, nodeInd = 0, nodeFreq = 0;
                    //Auto:  wgt1 <= wgt2 ; wgts ? nodeFreqs ; NodeFreq <= NodeFreq2
                    cltg[ApdInd++] = wgt1;
                    cltg[ApdInd++] = wgt2;
                    wgt1 = getWgt(j, ref wgtMin);
                    wgt2 = getWgt(j, ref wgtMin);
                    //nodeInd = 0;
                    nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
                    int nodeFreq2 = 0xffff; // cltg[nodeInd++] + cltg[nodeInd++];
                    for (int loops = 0; loops < clen[j] - 3; loops++) { //-1=default + -2=already put in 2 freqs.
                    if (wgt2 <= nodeFreq) {
                        cltg[ApdInd++] = wgt1;
                        cltg[ApdInd++] = wgt2;
                        wgt1 = getWgt(j, ref wgtMin);
                        wgt2 = getWgt(j, ref wgtMin);
                        if (nodeFreq2 == 0xFFFF) {
                            nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                        }
                    } else if (wgt1 <= nodeFreq) {
                        cltg[ApdInd++] = wgt1;
                        cltg[ApdInd++] = nodeFreq;
                        wgt1 = wgt2; wgt2 = getWgt(j, ref wgtMin);
                        if (nodeFreq2 == 0xFFFF) {
                            nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
                        } else {
                            nodeFreq = nodeFreq2;
                            nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                        }
                    } else if (nodeFreq2 < wgt1) {
                        cltg[ApdInd++] = nodeFreq;
                        cltg[ApdInd++] = nodeFreq2;
                        //if (nodeFreq2 != 0xFFFF) { nodeFreq = nodeFreq2; }
                        nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
                        if (nodeInd < ApdInd) {
                            nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                        } else {
                            nodeFreq2 = 0xFFFF;
                        }
                        //nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                    } else if (nodeFreq < wgt1) { //(nodeFreq < wgt1)
                        cltg[ApdInd++] = nodeFreq;
                        cltg[ApdInd++] = wgt1;
                        if (nodeFreq2 == 0xFFFF) {
                            nodeFreq = cltg[nodeInd++] + cltg[nodeInd++];
                        } else {
                            nodeFreq = nodeFreq2;
                            nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                        }
                        wgt1 = wgt2; wgt2 = getWgt(j, ref wgtMin);
                    }
                    //if (nodeFreq2 == 0xFFFF) {
                    //    nodeFreq2 = cltg[nodeInd++] + cltg[nodeInd++];
                    //}
                    } //end for clen
                    //if (nodeFreq2 != 0xFFFF) { nodeFreq = nodeFreq2; }
                }
                //int first = 0, last = 0, prev = 0, next = 0, i = 0;
                //int len = 1;
                //while (i != last) { //Note: When a number points to itself, it is the largest?
                //    int curfreq = freq[charScan * 0x100 + clst[charScan * 0x100 + i]];
                //    if (curfreq < freq[charScan * 0x100 + clst[charScan * 0x100 + prev]]) {

                //    } else if (curfreq < freq[charScan * 0x100 + clst[charScan * 0x100 + next]]) {

                //    }
                //}
            //}

            //Node table with weight/freq values and pointer info....

            //Make Char tables to be put in ROM.

            //Compress text.
        }
        private int getWgt(int initChar, ref int wgtMin) {
            int theWgt = 0xFFFF, wfInd = 0;
            for (int i = 0; i < clen[initChar]; i++) {
                int aWgt = freq[initChar * 0x100 + clst[initChar * 0x100 + i]];
                if (wgtMin <= aWgt) {
                    if (aWgt < theWgt) {
                        theWgt = aWgt;
                        wfInd = initChar * 0x100 + clst[initChar * 0x100 + i];
                    }
                }
            }
            wgtMin = theWgt;
            freq[wfInd] = 0; //Temp?
            return theWgt;
        }[/spoiler]
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Due to compression logic complexity, I've written my tool in Haskell. You can find it's core here
Up to the line 159 goes decompression code, after line 168 goes compression. The code is commented all right, but general idea for compression is:
1.Build histogram of all chars, which goes after char 0 and build huffman tree out of it with classic merging nodes procedure
2.Build histogram of all chars, which goes after char 1 and build huffman tree out of it with classic merging nodes procedure
...
X. Build histogram of all chars which goes after char FF and build huffman tree out of it with classic merging nodes procedure
Now we have 0x100 huffman trees for each character in script - compress each char by corresponding Huffman tree.
Then do the serialization and dump binaries for pasting these blocks in ROM.

Generally, I can write a fully functional tool in couple of hours: it will be possible to decompress any message or whole script into binary or ASCII; or compress script into binaries ready to be inserted in ROM.
The problem stays the same: compressed text blocks are 5 bytes larger than original compressed block. Trees and pointer blocks are same in size. Sometimes tree varies from original state, but overall compression efficiency should stay the same. The fact that it don't: sometimes my messages are 1 byte shorter, but +5 times they are 1 byte longer. So the main problem is how can I track down why game's compression excels mine.

Daddy Poi's Oily Gorillas

#20
Okay. I just want to say that I did finish the compression in C# enough to make my own compression possible... and seems to match with GS's original compression... (I might still make some more minor changes to it, though...) (See mine here http://pastebin.com/WNTPhv2t , and it also puts the data in a ROM. The end-goal is to have the new compressed data put in the same spot as the original data, though.... Which requires a code scanner to detect all the offsets and modify them accordingly... Not that difficult.) (My src data is supposed to be formatted the same way GSToolkit puts the uncompressed data into the ROM, there is a possibility in my changing that in the future, though. At least for compatibility to add more lines of text... when I get there, maybe.. - Speaking of that, I would also like to try an experiment that finds all text indexes in the code... but not sure how well that will go.)
It takes maybe around 0.02~0.03 seconds to compress all text, which is about ~10 times faster than my decompression routine. (Which I plan to optimize later by generating some tables from the char data before decompressing.)

Anyway, I did look at your Haskell program, but unfortunately, I'm pretty new with it, so I might need to do some research... Did you ever find the problem though? (Probably not, since your github wasn't updated?) And how long does it take your program to compress?


---
For my editor, I can see the possibility of having checks to detecting if ROM was modified by:
- Atrius's editor (If master file table doesn't start with 08000000)
  - Since all the text would get decompressed on loading the ROM in the editor, I wouldn't need to fix that area, but I would need to scan for edited sprites, and the Innate Psynergies.
- GSToolkit. (Do checks on the code it inserts.)
  - So we can load the text as already uncompressed... and patch the code to re-allow compression.
- I am not sure, but I might review your Haskell program some time to see if I could make my editor compatible with ROMs modified with it. (My guess is that I wouldn't need to do anything.)
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Quote from: Fox on 11, August, 2015, 06:34:20 AM

Anyway, I did look at your Haskell program, but unfortunately, I'm pretty new with it, so I might need to do some research... Did you ever find the problem though? (Probably not, since your github wasn't updated?) And how long does it take your program to compress?


I have found the issue - my compression works better now for a dosen of bytes, than original. The problem was in a way I've put nodes and leafs with the same weights: there are 4 ways to organize trees with the same efficiency. The trick was that each messages is stored with 8-bits alignment (as pointers are in bytes, of course). So, some of ways for tree build is more effective for exact this script, and some are not. I just found the way, which gave me even better result, than Atlus had.

Now I'm writing interface code for compression, I don't have much time, so hope to finish it this week.

For speed, Haskell is pretty slow: on my 1.6 GHz it takes about 3-4 seconds to compress the script, it also takes much more, than your code to decompress. That all due to code nature: it doesn't work with byte arrays or something: it parses binary to higher algebraic abstractions for programmer's conveniency and then takes a bunch of time to manipulate these items. It's very visual, easy to debug and error-proof, but pretty slow. I don't plan to optimize it - it will hurt code clarity.

Daddy Poi's Oily Gorillas

#22
Quoteon my 1.6 GHz
Mine is:
Processor: Intel(R) Core (TM) i5-4200U CPU @ 1.60GHz  2.30 GHz

I also rewritten my decompression routine... It's around ~0.1 secs now. (Rounding figure... it may be somewhat lower.) -As mentioned previously, the difference is that this time I scanned the char tables to generate bitcode data.

(By the way, with Visual Studio, there are two Build Modes.... Release Build works faster than Debug Builds.)

In the original GS2 game, the largeest bitcode is 16 bits... since my newer code has a limit on how long bitcodes can be, I might keep both versions of the decompression for now... Maybe
        public byte[] decompText(byte[] src) { //, int srcInd) { // int srcPos) {
            DateTime c = DateTime.Now;
            int[] bitcode = new int[0x10000];
            byte[] bitLen = new byte[0x10000];
            short[] bitChar = new short[0x10000];
            //Scan char data to generate data for faster decompression than old method.
            int asmpchar = Bits.getInt32(src, 0x38578) - 0x8000000;
            int asmptext = Bits.getInt32(src, 0x385DC) - 0x8000000;
            int chardata = Bits.getInt32(src, asmpchar) - 0x08000000;
            int charpntrs = Bits.getInt32(src, asmpchar + 4) - 0x08000000;
            for (int char1 = 0; char1 < 0x100; char1++) {
                if (charpntrs == asmpchar) { break; }
                if (Bits.getInt16(src, charpntrs) == 0x8000) { charpntrs += 2; continue; }
                int charTree = (chardata + Bits.getInt16(src, charpntrs)) << 3; charpntrs += 2;
                int charSlot = charTree - 12;
                byte bits = 0; int bitC = 0; int entry = (char1 << 8);
                do {
                    while (((src[charTree >> 3] >> (charTree++ & 7)) & 1) == 0) { bits++; }
                    bitChar[entry] = (short)((Bits.getInt16(src, charSlot >> 3) >> (charSlot & 7)) & 0xFFF); charSlot -= 12;
                    bitLen[entry] = bits; if (bits >= 24) { return decompTextOld(src); }
                    bitcode[entry] = bitC;
                    while (((bitC >> (bits - 1)) & 1) == 1) { bits -= 1; bitC ^= 1 << bits; }
                    bitC |= 1 << (bits - 1);
                    entry += 1;
                } while (bits > 0);
            }
            //Console.WriteLine(DateTime.Now - c);
            //c = DateTime.Now;
            int textTree = 0, textLenAddr = 0;
            byte[] des = new byte[0x800000]; int desEntry = 0, desPos = 0xC300;
            for (int srcI = 0; srcI < 12461; srcI++) {
                Bits.setInt32(des, desEntry, desPos - 0xC300); desEntry += 4;
                int srcInd = srcI;
                if ((srcInd & 0xFF) == 0) {
                    textTree = Bits.getInt32(src, asmptext + ((srcInd >> 8) << 3)) - 0x08000000;
                    textLenAddr = Bits.getInt32(src, asmptext + ((srcInd >> 8) << 3) + 4) - 0x08000000;
                } else {
                    int cLen;
                    do {
                        cLen = src[textLenAddr++];
                        textTree += cLen;
                    } while (cLen == 0xFF);
                }
                int initChar = 0, bitnum = 0, val = 0, textTree2 = textTree;
                do {
                    while (bitnum < 24) { val |= (int)src[textTree2++] << bitnum; bitnum += 8; }
                    int entry = initChar << 8;
                    while ((val & ((1 << bitLen[entry]) - 1)) != bitcode[entry]) { entry++; }
                    initChar = bitChar[entry]; val >>= bitLen[entry]; bitnum -= bitLen[entry];
                    des[desPos++] = (byte)initChar;
                    //if (desPos >= 0x10000) { break; }
                } while (initChar != 0);
            }
            Console.WriteLine(DateTime.Now - c + " (Text Decompression)");
            return des;
        }



---
Wasn't really planning to link this at first, but I do have an early version that works... (My open-source program is here... But you can go to bin>Release to find the program. Once it's up, and ROM is loaded, then click Editors button.)
https://drive.google.com/file/d/0B-uU7QfJXcL9Q0FweEhxQTNmOGc/view?usp=sharing (Don't think it includes the rewritten decompression routine shown above, though.)
It's only a "Bug fix" because I gave someone a previous version... and then I fixed the minor bugs afterward.. (The bugs were not in the decompression or compression, though.)
Note that the reason the form takes a second or two to load is not b/c of decompression... It's because of the listbox. -- Otherwise, I likely would have worked on that Find feature sooner... and Listbox items don't update unless you close/open window.)
Basically once you hit "Edit", the text will be compressed to the ROM @ 0x08FA0000. But you still must save from the main form to write it back to your ROM file.
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Wow, that looks impressive.
Be warned, that I get unhandled exception when I push Editors.
Check details here.

Daddy Poi's Oily Gorillas

#24
Thanks.
*Tests that particular version through the EXE itself.* - I don't seem to be getting any errors?
Since it is an index out of range, I wonder which ROM you may be using? (Esp. with the foreign text in that link. Russian or something?- Don't know that language.)
GS2 (UE) is the only one with text supported in this version.
Looking back at my code, it looks like I may have map viewer compatibility for other versions. (Including GS1, and the Golf game.) Hope you like the instant map loading speed. :)
I go for English first, obviously. (Haven't really checked for this particular data, but I do know some GS2 ROMs may have some of their data in the same place.) 

---
Anyway, the plan for the editor some day is to support all versions of GS1, GS2, and GS3.. including translations if it's a mostly simple process... and perhaps the same for the Tennis and Golf games...? But these are my hopes, whether I get to them is the bigger question.

---
Looks like I might have done a booby.... If you open a ROM that is incompatible with map loading... (i.e. Tennis game, I think.) then the editor will have an error on loading it again... I guess go to C:\Users\<username>\appdata\Local ... and delete the editor directory (gsmagic) when that happens. - The only thing I store there is the ROM last loaded for convenience. (Well there, because that's where Visual Studio stores the settings.)


---[spoiler]
Thoughts about planning Editor file in ROM and/or as external file... Formatting idea / Work-In-Progress. (Thoughts/Ideas as in... none of this has been implemented yet. But I'd like to come up with an approach that doesn't take a lot of space. With the possibility of editor-related text being an exception, but not sure.)
Header:
ARM instruction (Because instruction at 08000000 may point here.... and this may be to point back to wherever code execution would go...)
Magic Number? (To verify that the pointed location was for the Editor File. Might go before the Arm Instruction??)
Sub-file header:
Identifier/magic number & may include size of this sub-file(?) (32-bit) = Whatever this is determines how/when file is read/used. (All these identifiers from each sub-file may be read before main form pops up.  - I thought this idea would help with re-ordering tables as well. - Oh, and this might just be a number, and not in ASCII. With a possiblity of certain ranges having flags for different stuff/etc.)
Size of table in ROM. (32-bit if 7 bits go unused... otherwise 25-bit, but could be determined by Identifier; Like if size in measured by number of 32-bits in some cases (i.e. code), max 23-bit could do.) = Size so that inserting new tables is simple enough... and when you check the next file, you'll add this to the address. ; Depending on what comes to mind, not sure if this could be combined with Identifier in some cases.
Data stuff. (What is here may depend on the Identifier value. - Some Identifiers may not need any data here.)

Identifier Ideas:
Code = All code gets scanned when things get repointed...
Unsupported = Basically data that is skipped over...
Free = When tables are resized, this helps to know how much free space is available.
Text Editor = May have two data values for main tables of Char and Text strings. (Depends on approach.)
Graphic = Well, there are some uncompressed graphics, at least.
Custom = There may be several custom identifiers... and the data is to list the datatypes/bit-lengths of the data and stuff...
Possible idea for Custom Types: (Not sure if any of these will share the same identifier or not.)
- Plain complex data tables using a constant length per entry. (Think items, abilities, enemies... enemy groups, etc... data)
- Object tables (Where the start of the list is a pointer list... Pretty much implying variable length entries. Sort of like the Master File Table(?) I don't think there are any others in that type of format, though?)
- When pointers need updating for repointering stuff?
- Simple stuff like when there is one data type used? / Or one kind of bitlength (aka: each entry could hold one piece of data that is either 8-bit, 16-bit, or 32-bit.)... (I mean, so simple that we don't need any data in the data portion of this editor subfile? Dunno, because labeling...)
Etc.


Most of the above should be taken with a grain of salt, I guess.... For all I know, a different strategy could be taken into account... i.e. I could look into storing only table size data, and everything else be done in the editor code.... with the expectation that the only thing needed is table resizing, and no compatibility with adding additional tables/changing their order... No idea. Never hurts to start simple, though.
[/spoiler]
Golden Sun Docs: Broken Seal - The Lost Age - Dark Dawn | Mario Sports Docs: Mario Golf & Mario Tennis | Misc. Docs
Refer to Yoshi's Lighthouse for any M&L hacking needs...

Sometimes I like to compare apples to oranges. (Figuratively) ... They are both fruits, but which one would you eat more? (If taken literally, I'd probably choose apples.)
Maybe it is over-analyzing, but it doesn't mean the information is useless.


The only GS Discord servers with significance are:
Golden Sun Hacking Community
GS Speedrunning
/r/Golden Sun
GS United Nations
Temple of Kraden

Can you believe how small the Golden Sun Community is?

2+2=5 Don't believe me? Those are rounded decimal numbers. Take that, flat earth theorists! :)

approxies

Ah, yes indeed I've tried GS1 version, sorry for misinformation.
Anyhow, I've finished my tool, that's mainly for translation purposes.
It supports GS1 and GS2, decoding and encoding. It can output as text or binary, but encoding is only for binary.*.bat file also inserts the data to ROM and warns, if data doesn't fit in place.