Results 1 to 10 of 10

Thread: Help with LZSS algorithm... (scummvm LZSS)

  1. #1
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Unhappy Help with LZSS algorithm... (scummvm LZSS)

    Hi,

    I don't know if this forum is the correct to ask for help, but I hope that someone can give me light to this.
    I'm trying to translate a game (Star Trek Judgment Rites, Enhanced CD, from English to Spanish), it is very old, from 1993, from Interplay, but the files are compressed with LZSS (or similar) algorithm.

    I have a fake compression algorithm, which increases the size of the file, but I need the original compression to avoid the files to be bigger than 0xFFFF.
    I tried to make the compression algorithm, tested another bunch of codes for LZSS found in inet, but without success.

    This is the decompression algorithm for the files (it can be improved, of course, but this is not what I'm searching for now). I need the compression algorithm if possible:

    Code:
    void _uncompresslzss(uint32 CompSize, uint32 UnCompSize) {
        int N = 0x1000; // This is 4096.
        int THRESHOLD = 3, i = 0, j = 0;
        byte *HisBuf = new byte[N];
        byte b = 0, Length = 0, flagbyte = 0;
        unsigned short int offsetlen;
        uint32 outstreampos = 0;
        uint32 bufpos = 0;
        bool end = false;
        unsigned int BytesRead = 0, Offset = 0;
    
        memset(HisBuf, 0, N);
    
        BytesRead = 0;
    
        while (!end) {
            flagbyte = (byte)compdata[BytesRead];
    
            if (BytesRead == CompSize) end = true;
            BytesRead++;
    
            if (!end) {
                for (i = 0; i < 8; i++) {
                    if ((flagbyte & (1 << i)) == 0) {
                        offsetlen = (byte)compdata[BytesRead] + ((byte)compdata[BytesRead + 1] << 8);
    
                        if (BytesRead == CompSize) end = true;
                        BytesRead += 2;
    
                        if (!end) {
                            Length = (offsetlen & 0xF) + THRESHOLD;
                            Offset = bufpos - ((offsetlen >> 4) & (N - 1));
    
                            for (j = 0; j < Length; j++) {
                                b = HisBuf[(Offset + j) & (N - 1)];
                                uncompdata[outstreampos++] = b;
                                HisBuf[bufpos] = b;
                                bufpos = (bufpos + 1) & (N - 1);
                            }
                        }
                    }
                    else {
                        b = (byte)compdata[BytesRead];
    
                        if (BytesRead == CompSize) end = true;
                        BytesRead++;
    
                        if (!end) {
                            uncompdata[outstreampos++] = b;
                            HisBuf[bufpos] = b;
                            bufpos = (bufpos + 1) & (N - 1);
                        }
                    }
                }
            }
        } 
    
        if (outstreampos != UnCompSize)
            printf("WARNING: file might not have been extracted correctly! ");
    
    }
    I attached two sample files (2x compressed + 2x uncompressed), one is bigger than the other, which is very small:

    I hope someone can help me, as far as I've seen there is not any tool for the compression algorithm (at least I have not found it).

    Any help will be appreciated.

    Thanks a lot.
    Attached Files Attached Files

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Not sure if I already encountered it, but here I modified encoder from this: https://encode.su/threads/3140-LZ98?...ll=1#post60866
    Size seems to be similar, but not an exact match.
    Attached Files Attached Files

  3. Thanks:

    L@Zar0 (18th October 2019)

  4. #3
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Not sure if I already encountered it, but here I modified encoder from this: https://encode.su/threads/3140-LZ98?...ll=1#post60866
    Size seems to be similar, but not an exact match.
    Shelwien,

    Thank you very much!!!

    I have tested the compressor you've modified in both games that uses it, one is Star Trek 25th Anniversary and the other Star Trek Judgment Rites. The two tests I have done, had worked flawlessly. I don't know if I will have any problems with any other file compressed, but I have very high expectations.

    I hope you will accept that I give credits to you for the encoding algorithm in the Readmes (altough in spanish) for your help.

    I will post if I have any problem in the future, but I think that if the algorithm is working with a bitmap file of 64008 bytes and with FEDS.RDF file (more complex, with opcodes for texts and actions), the rest of the files will work also.

    I'm very grateful to you.

    Regards.

  5. #4
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Well,

    it seems I have called victory so soon. I attach a new file with I'm working (translated) for: samplefile2.zip. Uncompressed and compressed format (with lzss_rdf_v0).

    It seems that the algorithm of lzss_rdf_v0 compress and uncompress well the file. Even the algorithm in the first post which supposedly worked with the game and I'm using is working well when decompressing. I assume that this algorithm may work with different compressed files if the main params are the same like window size, N param, etc...
    But the game is having problems somehow interpreting the data. I think that the compression must be adjusted somehow.

    I include also, originalsamples.zip, with the original (not translated) files that came with the game. It is very odd, but if I use the fake compressor, they work well with the game.

    Somehow I've seen that the 0x1F (0001 1111) is used normally when a bunch of the same bytes are correlative. In lzss_rdf_v0 tool, this changes a bit with 0xnF . This is represented in offset 0x8A of both compressed files (the original of the game and the compressed with lzss_rdf_v0).

    I will try to check it and hope to find a solution, but I'm very bad when programming bitwise operations.
    Attached Files Attached Files

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > But the game is having problems somehow interpreting the data.
    > I think that the compression must be adjusted somehow.

    There're some choices for encoder implementations:
    1. allowing overlapped matches (dist<len)
    2. allowing wrapped matches (ones that start at the end of window buffer)
    3. whether window buffer is updated during match copying or after

    Based on your decoder its "yes" for all 3, but it only means anything
    if reverse-engineered decoder works exactly as game code.

    Normally I'd suspect hashtable-based matchfinder, which is the usual cause
    for different encoder output, but here it seems that longer distances
    are preferred in compressed files, which is kinda incompatible with hashchain matchfinders.

    > I include also, originalsamples.zip, with the original (not translated) files that came with the game.
    > It is very odd, but if I use the fake compressor, they work well with the game.

    Is it possible that translation breaks internal structure,
    or translated file is longer and doesn't fit in some buffer?

    > Somehow I've seen that the 0x1F (0001 1111) is used normally when a bunch of the same bytes are correlative.

    Some formats (eg. LZ4) have special values which enable different behavior
    (eg. read extra bytes for length or distance).
    https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1895

    > I will try to check it and hope to find a solution, but I'm very bad when programming bitwise operations.

    Here I added some match logging:
    Code:
    008F: 0032/005D/7  0032/005D/7 004A/0045/7 005C/0033/7 0072/001D/7
    ^     ^^^^^^^^^^^                                      ^    ^    ^-- match length
    |     \---------- pos/d/l for actual decoded match     \    \------- match distance
    |                                                       ------------ abs. position
    \---- position in unpacked file
    But I don't understand the difference (ie how encoded match is chosen from available options)
    Could be a "lazy" matchfinder, or some hashtable or anything.
    Attached Files Attached Files

  7. #6
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks again, I will check your results and let you know. I want to answer some of your points tough:


    * I seen the QuickBMS tool mentioned in the thread you posted. In fact, I had checked the algorithm with EI, EJ params, but I though it was a very complicated method. It did not work for me. It worked with trees I think, and the want you passed me works with an array, as the decoder one.


    * About the translated file, maybe I have expressed bad previously what I want to say.
    I thought also what you've said, maybe I had a problem with the translated file and its offsets. The file with new translated texts is obviously bigger, maybe 5000-6000kb. If I use the "fake" compressor, so, it is even bigger than the uncompressed one, and use it in game, it works well, I can see the texts from the beginning. If I use the encoder v0 which we are trying to decipher here, It encodes/decodes perfectly, even with my decoder algorithm of the first post, but the game misses some lines I can see at the beginning. Strangely, it seems to decode somehow the file, if it was not correct, the game crashes.

    Look the offset 0x75A0 in the *.NEW file (which is called from the offset 0xA64 in the file). From here I put some translated lines, which are called inside the code (with updated pointers), aligned to 16bits for clarify the texts. Maybe too much zeroes between lines make some garbagging in the encoder/decoder.
    Some other lines, the ones that uses the original offsets to the texts seems to work well. I think that from some point, the executable decodes the data differently as expected.


    * Lastly,
    There is this code:
    1F 00 1F 00 F0 1F 00 1F 00 1F 00 1F 00

    which is all the bunch of zeroes that there are in the original file.
    1F = 0000 0001 1111 (so Length = 15, Offset = pos (of the match in History Buffer/Window?) - 1.
    But it seems that this position is always the same, it seems to find the mach absolutely and not relatively, may be?


    There are things that I do not understand from this compression, like the window size.
    Is it moveable? It refreashes (resets the data) and begins again -maybe the window resets when full-? Reuses the previous data/dictionary?

    I understand minimally the LZSS algorithm, and I think that the matches are of 16bytes, can it be?

    I'm really a newbie for this!!

    PD: If you need some other file, I can pass it to you. There are smaller files which can help better to solve this issue.


    =========== EDIT ==============
    I have attached a small file which you can see very easy the changes, from the original file and the one with lzss_rdf_v0 compressor.

    If I use the (r >= ml) -the commented-, the result is more similar than (r > ml), -except for one match offset near the end-.

    if (r > ml) {
    ml = r;
    mj = j;
    }
    /*
    -->> if (r >= ml) {
    ml = r;
    mj = j;
    }
    */
    Attached Files Attached Files
    Last edited by L@Zar0; 18th October 2019 at 23:44. Reason: Update with file

  8. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > encodes/decodes perfectly, even with my decoder algorithm of the first post,
    > but the game misses some lines I can see at the beginning.

    All kinds of subtle differences are possible if your decoder from first post
    doesn't work exactly like the one in the game.

    > There is this code:
    > 1F 00 1F 00 F0 1F 00 1F 00 1F 00 1F 00

    maxlen is 18, so it uses a series of (-1,18) matches to encode a long string of zeroes.

    > If I use the (r >= ml) -the commented-, the result is more similar than (r > ml),

    on this file yes, but on your first sample there're more matches chosen with longest distance.
    In any case, I don't think this is the problem, unless len=0xF has some special meaning
    (like maybe its decoded not as 15+3 but as 256 or something).

    > PD: If you need some other file, I can pass it to you.
    > There are smaller files which can help better to solve this issue.

    Either capture the output of game's decoder, or give me the executable that contains the decoder.

  9. #8
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts
    The code of the first post is implemented as is. It's not an executable with the decoder. It's public. I've adapted the algorithm that is in the sources of SCUMMVM to my tool por uncompress the files I need to translate, although the original packed file of the game has a lot of subfiles:

    https://github.com/scummvm/scummvm/b...rtrek/lzss.cpp

    As far as I've seen, "offsetlen & 0xF" is only to mark the length. So, the max byte length of an offset, calculated with "Offset = bufpos - ((offsetlen >> 4) & (N - 1));", is 18 bytes, I assume (this is 15 + THRESHOLD, which is 3).

    I have done some changes in the lzss_rdf_v0 like:

    Code:
            for (i = 0; i < 8; i++) {
    
                if (inpptr >= inp + inpsize) break;
    
                l = inp + inpsize - inpptr;
                if (l > 15 + 3) l = 15 + 3;
    
                search_pos = 0;
    
                ml = 0; // max match len
                mj = 0; // max match pos
    
                for (search_pos = 0; search_pos < pos; search_pos++) {
                    //for( r=0; r<l; r++ ) if( inpptr[r]!=inp[j+r-0xFEE] ) break;
                    for (r = 0; r < l; r++)
                        if (inpptr[r] != inp[search_pos + r]) {
                            break;
                        }
                            
                    if (r > ml) {
                        ml = r;
                        mj = search_pos;
                    }
    
                    if (r >= THRESHOLD) break;
                }
    
                if (ml < 3) {
                    // literal
                    c = *inpptr++;
                    win[(pos++) & winmask] = c;
    
                    *outptr++ = c;
    
                    flag |= 1 << i; // set literal flag
                    } else {
                    // match
    
                    // dddddddd   //l = (d & 15) + 3;
                    // DDDDllll   //d = c + (d>>4)*256;
    
                    // ddddllll
                    // DDDDDDDD
    
    
                    //        offsetlen = (byte)compdata[BytesRead] + ((byte)compdata[BytesRead + 1] << 8);
                    //        Length = (offsetlen & 0xF) + THRESHOLD;
                    //        Offset = bufpos - ((offsetlen >> 4) & (N - 1));
    
                    mj = search_pos - mj + 1;
    
                    *outptr++ = (mj << 4) + (ml - 3);
                    *outptr++ = byte(mj >> 4);
    
                    for (j = 0; j < ml; j++) {
                        c = *inpptr++;
                        win[(pos++) & winmask] = c;
                    }
                }
            } // for
    
            *pflag = flag;
            }
    It helps a bit with the "1F 00"s, but I can not calculate the offsets that are near the end. I'm trying to calculate them somehow.

  10. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > I've adapted the algorithm that is in the sources of SCUMMVM

    So is it scummvm that has problems with decoding of modified resources?

    If so, then my encoder itself shouldn't be a problem.
    What about header then? "FED8_COMP.IW" had a header with 16-bit unplen/—Āomplen fields, do you update these?

    Also, I found this (decompiled source attached):https://filetrip.net/nds-downloads/u...-0-f23640.html
    Attached Files Attached Files

  11. #10
    Member
    Join Date
    Oct 2019
    Location
    Barcelona
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Yes, that is what I've checked this morning, the header of the compressed files. The files patched with the new strings for the game use FILESIZE_UNCOMP_DATA + COMP_RAW_DATA.

    The game uses only the header of the uncompressed data, but I was using the one with compressed data. So, this was my mistake, I used an incorrect filesize, that's why some text where not used, because the uncompression was cutted. I've changed it and now is working perfectly.

    I apologize for my mistake.

    Thanks a lot for your help.

    I will check anyway the information you've provided. It can be interesting to discover the algorithm that matches the original compressed data.

    Regards.

Similar Threads

  1. LZSS Help
    By Z3R0X in forum Data Compression
    Replies: 0
    Last Post: 25th March 2018, 16:26
  2. CUDA LZSS
    By Piotr Tarsa in forum Data Compression
    Replies: 20
    Last Post: 25th July 2014, 22:41
  3. LZSS + Huffman?
    By comp1 in forum Data Compression
    Replies: 2
    Last Post: 31st March 2014, 17:04
  4. Optimized LZSS compressor
    By encode in forum Data Compression
    Replies: 11
    Last Post: 13th February 2014, 22:51
  5. LZSS v0.01 is here!
    By encode in forum Data Compression
    Replies: 67
    Last Post: 28th March 2012, 10:10

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •