Page 1 of 2 12 LastLast
Results 1 to 30 of 48

Thread: Cracking an old MS-DOS game's compressed file.

  1. #1
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts

    Cracking an old MS-DOS game's compressed file.

    Hi,

    I've been trying to reverse engineer the file formats of an old MS-DOS game called Cartooners (by Electronic Arts, 1988). The installer comes with the game's files stored in *.PEA files and an installation script stored in a *.IEA file. Data in both file types is compressed. Both use the same compression technique. I know this because when the relevant data from the *.IEA file is copied to a *.PEA file and a suitable header (which I have completely reverse engineered) is added the data will be decompressed to a file by the installer.

    I have been trying to understand how the data is compressed with litle success and only found out that:
    -When interpreting the compressed data as string of nine bit long chunks I can extract fragments that clearly uncompressed literals.
    -The frequency of these literals quickly diminishes and it appears that where compressed data (assumed to be compressed) is present this is instead of a literal fragment that has already occurred.

    'The compressed data: (asterisks repressent data assumed to be compressed.) The first 10 bytes in the IEA file are ignored. They appear to be part of a header.
    Code:
    *
    **BLINK = 8*ask$=" "*Drive2$**N******if (DosVersion < 200)*color 7,0*ls*say"Inst
    allat**not *mplete!***Carto**s requi**DOS*.0 * high***"P**e*eb**you*sy*em**a ***
    **and try again*exi*1*e*****c**t* 0*,*ÉÍ***********»*L*eNum=*
    A direct link to the actual compressed file:
    https://drive.google.com/open?id=1T3...-xLWOIuBL7qqH-


    'The decompressed counterpart to the above data:
    Code:
    
    
    
    
    
    BLINK = 8
    ask$=" "
    Drive2$ = "N"
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    if (DosVersion < 200)
    color 7,0
    cls
    say"Installation not complete!"
    say"Cartooners requires DOS 2.0 or higher"
    say"Please reboot your system on a later DOS and try again"
    exit 1
    endif
    
    
    
    
    
    
    
    
    
    
    
    
    cls
    atsay 0,0, "ÉÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ»"
    LineNum=1
    :NextLine
    atsay LineNum,0, "º"
    I've added a link to a program written in VB.Net that can dump IEA/PEA files as displayed above. The .zip file also contains a documents folder with both the compressed and decompressed iea file and a description of a IEA and PEA file's header. The program can also copy the IEA file's data to a be PEA file which can be decompressed and written to a file by the game's installer. I assume that the installer normally extracts the IEA file to memory but extracting it from there is too much hassle.

    Link:
    https://drive.google.com/open?id=162...9_ULkau56hQloR

    Can anyone help me figure out how the data is compressed? Hopefully my explanations make sense. I can provide other files on request.
    Last edited by Peter Swinkels; 18th July 2018 at 12:37. Reason: Included an extra link. Missing word.

  2. #2
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    58
    Thanks
    15
    Thanked 21 Times in 16 Posts
    Looks like LZ77 where the ** you have specify an offset, length into the preceding data. For example, in "Installat**not", the ** likely refers to the string "ion " from the earlier DosVersion < 2000 ; Since I see lonely *, it's probably the case that the high bit of any byte identifies a literal vs an offset, length reference. It's also possible that the (lentgh, offset) is packed into just 7 bits, so longer matches need multiple bytes. No clue without looking at the actual bytes, but this seems like a vanilla replacement scheme.

  3. Thanks:

    Peter Swinkels (17th July 2018)

  4. #3
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Stefan Atev View Post
    Looks like LZ77 where the ** you have specify an offset, length into the preceding data. For example, in "Installat**not", the ** likely refers to the string "ion " from the earlier DosVersion < 2000 ; Since I see lonely *, it's probably the case that the high bit of any byte identifies a literal vs an offset, length reference. It's also possible that the (lentgh, offset) is packed into just 7 bits, so longer matches need multiple bytes. No clue without looking at the actual bytes, but this seems like a vanilla replacement scheme.
    Hi,

    Thanks for the reply. Yes, it probably is something like that, but I can't make sense of what presumably are offsets. Filtering and shifting bits didn't help me at all. What I can find on the internet about compression is either too technical for me or doesn't appear to apply to this file. The readable text only appears when reading at nine bits at a time instead of the usual 8 bits. Here is a direct link to the actual file: https://drive.google.com/open?id=1T3...-xLWOIuBL7qqH-

    Peter

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    With its 512-bit alphabet, it could be also LZW - could be assigning new codes to new {c1;c2} pairs or something.
    Attached Files Attached Files

  6. Thanks:

    Peter Swinkels (17th July 2018)

  7. #5
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    With its 512-bit alphabet, it could be also LZW - could be assigning new codes to new {c1;c2} pairs or something.
    Hi Shelwien, thanks for your reply. I see your program assumes nine bits per "byte" throughout the Cartoon.iea file. I should've mentioned that that can't be true because (9994 - 10) / 9 doesn't result in an even number. (9994 -10 is the file's size minus the header.)
    Last edited by Peter Swinkels; 17th July 2018 at 16:50. Reason: typo

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    See 9dec3 in previous post - I attempted implementing some LZW.
    As to bits per symbol, its common for LZW to extend the alphabet when necessary - it probably starts using 10 bits per symbol when code 0x200 is added.
    For now though, 9dec3 has two problems:
    - symbol $103 is not "\n\r\n", so either $117 is defined incorrectly, or $103
    - symbol $124 is defined as $103+$124 (and this commonly happens later).
    So either more than one symbol is added per symbol (but two-letter symbols seem to match?), or
    "recursion" in symbol definition has some special meaning.

  9. Thanks:

    Peter Swinkels (17th July 2018)

  10. #7
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    See 9dec3 in previous post - I attempted implementing some LZW.
    As to bits per symbol, its common for LZW to extend the alphabet when necessary - it probably starts using 10 bits per symbol when code 0x200 is added.
    For now though, 9dec3 has two problems:
    - symbol $103 is not "\n\r\n", so either $117 is defined incorrectly, or $103
    - symbol $124 is defined as $103+$124 (and this commonly happens later).
    So either more than one symbol is added per symbol (but two-letter symbols seem to match?), or
    "recursion" in symbol definition has some special meaning.
    I see you made some progress. Thanks.

  11. #8
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    @Shelwien
    LZW usually reserves 2 codewords, not just 1 (clear dictionary, eof).
    I fixed a few bugs, it correctly decodes the file (until you need to switch to 10bits-per-codeword, obviously. Should be easy to extend it now).

  12. Thanks:

    Peter Swinkels (18th July 2018)

  13. #9
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by mpais View Post
    @Shelwien
    LZW usually reserves 2 codewords, not just 1 (clear dictionary, eof).
    I fixed a few bugs, it correctly decodes the file (until you need to switch to 10bits-per-codeword, obviously. Should be easy to extend it now).
    My C++ compiler in Microsoft Visual Studio 2017 Community is complaining about the "fopen" functions. It says I should use "fopen_s" instead. That's fine, but it turns out I have to change all the other file handling functions accordingly. My C++ knowledge isn't that great. Which changes do I need to make for me to use "fopen_s" instead of "fopen"? Thanks.

  14. #10
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    I'm not using VS2017, but this should do it. You should probably try it on more files.

  15. Thanks:

    Peter Swinkels (18th July 2018)

  16. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    Huh, I did it too though :)
    I think $101 code is EOF actually, at least it only occurs at the end in this file.
    Attached Files Attached Files

  17. Thanks:

    Peter Swinkels (18th July 2018)

  18. #12
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Huh, I did it too though
    I think $101 code is EOF actually, at least it only occurs at the end in this file.
    This version appears to decompress the file properly and works in Visual Studio. I am going to test it on the other compressed files. Thanks.

  19. #13
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Hi, I am now trying to convert your C++ code to vb.net. I mostly succeeded but there is a bug causing too much data to be written to the output file. EDIT: FIXED. (The variable "a" should've been converted to a byte before writing it to a file.)

    Code:
    Option Compare Binary
    Option Explicit On
    Option Infer Off
    Option Strict On
    
    
    Imports System
    Imports System.Convert
    Imports System.Environment
    Imports System.IO
    Imports System.Math
    
    
    Public Module Module1
       Dim lzw(65536, 2) As Integer
    
    
       Public Sub Main()
          Dim a As New Integer
          Dim Bit As Integer = 0
          Dim BitCount As Integer = 9 'wordsize
          Dim Character As New Byte 'c
          Dim d As Integer = 0
          Dim i_lzw As Integer = &H102%
          Dim InputFile As New BinaryReader(File.Open(GetCommandLineArgs(1), FileMode.Open))  'f
          Dim j As New Integer
          Dim OutputFile As New BinaryWriter(File.Open(GetCommandLineArgs(2), FileMode.Create)) 'g
          Dim pd As Integer = -1
    
    
          InputFile.BaseStream.Seek(10, SeekOrigin.Begin)
    
    
          For j = &H0% To &HFF%
             lzw(j, 0) = j
             lzw(j, 1) = -1
          Next j
    
    
          lzw(j, 0) = -1
          lzw(j, 1) = -1
          j += 1
          lzw(j, 0) = -1
          lzw(j, 1) = -1
    
    
          Do Until InputFile.BaseStream.Position >= InputFile.BaseStream.Length
             Character = InputFile.ReadByte()
    
    
             For j = 0 To 7
                d = d Or ((Character >> j) And &H1%) << Bit
                Bit += 1
                If Bit >= BitCount Then
                   If d = &H101% Then Exit Do 'EOF?
    
    
                   If d = &H100% Then
                      BitCount = 9
                      i_lzw = &H102%
                      pd = -1
                   Else
                      If pd = -1 Then
                         a = d
                      Else
                         a = Dump(If(d < i_lzw, d, pd), OutputFile)
                         lzw(i_lzw, 0) = pd
                         lzw(i_lzw, 1) = a
                         a = If(d < i_lzw, -1, a)
                         i_lzw += 1
                         If i_lzw = (1 << BitCount) Then BitCount += Abs(CInt(BitCount < 12))
                      End If
                      If a >= 0 Then OutputFile.Write(ToByte(a))
                      pd = d
                   End If
                   Bit = 0
                   d = 0
                End If
             Next j
          Loop
       End Sub
    
    
       Private Function Dump(d As Integer, OutputFile As BinaryWriter) As Integer
          Dim c As New Integer
    
    
          If d < &H100% Then
             c = d
             OutputFile.Write(ToByte(c))
          Else
             c = Dump(lzw(d, 0), OutputFile)
             d = Dump(lzw(d, 1), OutputFile)
          End If
    
    
          Return c
       End Function
    
    
    End Module
    Also what do
    a, d, j, and pd (all ints) refer to in your original code?
    Last edited by Peter Swinkels; 18th July 2018 at 19:45. Reason: Update.

  20. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    d is where bits are accumulated = current lzw symbol;
    pd is previous d value (-1 on start), LZW assigns new symbols to {pd;first-symbol-of-d} (or {pd;first-symbol-of-pd} in case of d=i_lzw)
    j is a current bit index in bytes of input file
    i is a current bit index in LZW symbols
    a is just some temp variable; dump(d) returns first byte of the string assigned to symbol d, then its used for writing a symbol to output file

  21. Thanks:

    Peter Swinkels (18th July 2018)

  22. #15
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    d is where bits are accumulated = current lzw symbol;
    pd is previous d value (-1 on start), LZW assigns new symbols to {pd;first-symbol-of-d} (or {pd;first-symbol-of-pd} in case of d=i_lzw)
    j is a current bit index in bytes of input file
    i is a current bit index in LZW symbols
    a is just some temp variable; dump(d) returns first byte of the string assigned to symbol d, then its used for writing a symbol to output file
    Thank you. The vb.net code works properly now btw.

  23. #16
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    d is where bits are accumulated = current lzw symbol;
    pd is previous d value (-1 on start), LZW assigns new symbols to {pd;first-symbol-of-d} (or {pd;first-symbol-of-pd} in case of d=i_lzw)
    j is a current bit index in bytes of input file
    i is a current bit index in LZW symbols
    a is just some temp variable; dump(d) returns first byte of the string assigned to symbol d, then its used for writing a symbol to output file
    A few more questions about your C++ code:

    Code:
              wordsize=9; i_lzw=0x102; pd=-1;
            } else {
              if( pd==-1 ) a=d; else {
                a = dump( (d<i_lzw)?d:pd, g );
                lzw[i_lzw][0]=pd; lzw[i_lzw][1]=a; 
                a = (d<i_lzw) ? -1 : a;
                if( ++i_lzw==(1<<wordsize) ) wordsize+=(wordsize<12);
    -Are indexes (i_lzw) lower than 0x102 ever used?
    -What the does "12" in the last line refer to? Why is "wordsize" bitshifted by 12 bits?
    Last edited by Peter Swinkels; 19th July 2018 at 16:58. Reason: typo

  24. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    > Are indexes (i_lzw) lower than 0x102 ever used?

    Indexes 0x00..0xFF correspond to plain unpacked bytes (literals), 0x100 is dictionary reset, 0x101 is EOF,
    so yes, its possible to remove lzw[] init code etc.
    Redundant stuff just remained from initial implementation.

    Btw, it should be actually possible to replace interative/recursive lzw[] lookup with simple offset;length pair,
    with a 4k unpacked data buffer. I somehow expected a more advanced LZW implementation at start.

    > What the "12" in the last line refer to?

    Dictionary size in the sample seems to be limited by 4096 = 1<<12;
    Actually first 0x100 code occurs when i_lzw already reached 4096, so it should be a 13-bit code,
    but it only has 12 bits in packed data. I had a different implementation at first (both 0x100 and 0x101
    could be used for reset code, with extra bit passed to the next symbol), but mpais' idea that dictionary
    is simply limited by 4k looked more reasonable.

    > Why is "wordsize" bitshifted by 12 bits?

    Its not a shift (<<), its a comparison (<). In C/C++, the expression (x<12) computes to 1 if x<12, 0 otherwise.

  25. Thanks:

    Peter Swinkels (19th July 2018)

  26. #18
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > Are indexes (i_lzw) lower than 0x102 ever used?

    Indexes 0x00..0xFF correspond to plain unpacked bytes (literals), 0x100 is dictionary reset, 0x101 is EOF,
    so yes, its possible to remove lzw[] init code etc.
    Redundant stuff just remained from initial implementation.

    Btw, it should be actually possible to replace interative/recursive lzw[] lookup with simple offset;length pair,
    with a 4k unpacked data buffer. I somehow expected a more advanced LZW implementation at start.

    > What the "12" in the last line refer to?

    Dictionary size in the sample seems to be limited by 4096 = 1<<12;
    Actually first 0x100 code occurs when i_lzw already reached 4096, so it should be a 13-bit code,
    but it only has 12 bits in packed data. I had a different implementation at first (both 0x100 and 0x101
    could be used for reset code, with extra bit passed to the next symbol), but mpais' idea that dictionary
    is simply limited by 4k looked more reasonable.

    > Why is "wordsize" bitshifted by 12 bits?

    Its not a shift (<<), its a comparison (<). In C/C++, the expression (x<12) computes to 1 if x<12, 0 otherwise.
    Personally I find the way the "dump" procedure is implemented a bit confusing. What is going on there? Oh, I see I confused the comparisson and bitshift operators. Duh. A more advanced LZW implementation? The program that uses it is from 1988.

  27. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    > Personally I find the way the "dump" procedure is implemented a bit confusing. What is going on there?

    Well, it looked like $104 = $102+$102 = \r\n\r\n at first, so I thought that its a LZ78 variant that can combine _any_ two symbols, not an actual LZW.
    "What's going on" is recursion - if we know how to print a symbol, then print it (for codes <256), otherwise print two symbols the current one consists of.

    > A more advanced LZW implementation? The program that uses it is from 1988.

    And LZ78 is from 1978, so what?
    There's been very little progress in compression since that time, algorithm-wise.
    Even BWT was (presumably) invented in 1983, original arithmetic coding patent is from 1977, etc.
    Only ANS and some CM ideas are new.
    And the recent trend is to use simpler algorithms actually (because of how hardware evolved), so
    I'd not be surprised if there was some old thing more advanced than what's recently available.
    For example, a PPM archiver called HA was popular for a time in 199x, but now PPM is considered too slow
    and LZ77 is mostly used (even if compression is worse).

    P.S. Here's another version, with dictionary buffer instead of recursion.
    Attached Files Attached Files

  28. Thanks:

    Peter Swinkels (19th July 2018)

  29. #20
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > Personally I find the way the "dump" procedure is implemented a bit confusing. What is going on there?

    Well, it looked like $104 = $102+$102 = \r\n\r\n at first, so I thought that its a LZ78 variant that can combine _any_ two symbols, not an actual LZW.
    "What's going on" is recursion - if we know how to print a symbol, then print it (for codes <256), otherwise print two symbols the current one consists of.

    > A more advanced LZW implementation? The program that uses it is from 1988.

    And LZ78 is from 1978, so what?
    There's been very little progress in compression since that time, algorithm-wise.
    Even BWT was (presumably) invented in 1983, original arithmetic coding patent is from 1977, etc.
    Only ANS and some CM ideas are new.
    And the recent trend is to use simpler algorithms actually (because of how hardware evolved), so
    I'd not be surprised if there was some old thing more advanced than what's recently available.
    For example, a PPM archiver called HA was popular for a time in 199x, but now PPM is considered too slow
    and LZ77 is mostly used (even if compression is worse).

    P.S. Here's another version, with dictionary buffer instead of recursion.
    Nice code. It decompresses the IEA file perfectly. Did you know that instead:
    Code:
    int i;
    for(i = 0; o < 10; i++) {printf("%d\n", i;};
    You can declare the iterator variable in the "for" statement:
    Code:
    for(int i = 0; o < 10; i++) {printf("%d\n", i;};
    ?

  30. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    Yes :) But it can be very inconvenient when you actually edit the source.

  31. #22
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    @Shelwien
    You're forgetting to account for the possibility of the encoder skipping on the dictionary reset, i.e., the dictionary becoming static.

  32. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    Its easy enough to handle - need to add something like
    d_ptr=__min(d_ptr,4096);
    i_lzw=__min(i_lzw,4096);
    But it seems unlikely, or it won't have 0x100 code exactly at the point where i_lzw reaches 4k (or at start).

  33. #24
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Peter Swinkels View Post
    Nice code. It decompresses the IEA file perfectly. Did you know that instead:
    Code:
    int i;
    for(i = 0; o < 10; i++) {printf("%d\n", i;};
    You can declare the iterator variable in the "for" statement:
    Code:
    for(int i = 0; o < 10; i++) {printf("%d\n", i;};
    ?
    Your code appears to be accessing the arrays out of bounds when "d" or "pd" are less than zero. Doesn't this cause unpredictable results? Also, could you provide a short summary of the compression algorithm being used? Or describe the decompression procedure in pseudo code?

  34. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    Yeah, replace "pd=-1" with "pd=0", I forgot.

  35. Thanks:

    Peter Swinkels (20th July 2018)

  36. #26
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    That did it. Also it occurred to me that "d_ptr" is a pointer, which means it should be passed by reference (ByRef in vb.net) to the "dump" procedure.

  37. #27
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Quote Originally Posted by Shelwien View Post
    Its easy enough to handle - need to add something like
    d_ptr=__min(d_ptr,4096);
    i_lzw=__min(i_lzw,4096);
    Not quite, you need to check if i_lzw is equal to 4096 before adding new entries to the dictionary. If you do it like that (and it should be min(..., 4095), not 4096, or you'll still be out of bounds) you will then keep overwriting the last entry until a dictionary reset.

    Quote Originally Posted by Shelwien View Post
    But it seems unlikely, or it won't have 0x100 code exactly at the point where i_lzw reaches 4k (or at start).
    We can't really know based on just one file. If anything, the opposite makes a lot more sense: if you always reset the dictionary when you reach 4096 entries, why would you need to emit a code just for that? That said, I've seen my fair share of LZW codecs that behave like that, I guess the authors probably didn't really understand the algorithm.

  38. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    > you will then keep overwriting the last entry until a dictionary reset.

    That's what I meant, just have to extend size of lzw[] by +1 to avoid overwriting the last usable entry.
    (Well, it was 64k in my source anyway.)

    Dictionary is the tricky one actually - new symbols are added at every _symbol_, not every unpacked byte.
    So a long series of d=i_lzw codes can bring unpacked size to 8386560 or so, before reaching i_lzw=4k.
    But I guess we can afford it on a PC, so how about this:
    Attached Files Attached Files

  39. #29
    Member
    Join Date
    Jul 2018
    Location
    Netherlands
    Posts
    37
    Thanks
    22
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by mpais View Post
    We can't really know based on just one file. If anything, the opposite makes a lot more sense: if you always reset the dictionary when you reach 4096 entries, why would you need to emit a code just for that? That said, I've seen my fair share of LZW codecs that behave like that, I guess the authors probably didn't really understand the algorithm.
    Here are more files to experiment on: https://drive.google.com/open?id=1Jk...WVjN5FUCTMkWwi

    Code:
    [Compressed File Archives]
    
    
    [File Information]
    Endianess: little
    Extension: *.pea
    
    
    [File Layout]
    Begin Structure: Compressed file - repeat for each file in the archive.
        0x00    0x03 BYTES    A prefix for each compressed file (0x1A + "EA").
        0x03    0x0D BYTES    The compressed file's name. Padded with 0x00 bytes.
        0x10    WORD        The year, month, and day at which the file was created/last modified. ***
        0x12    WORD        The hour, minute, and second at which the file was created/last modified. ******
        0x14    BYTE        Indicates whether the file data is compressed. (0x00 = FALSE, 0x01 = TRUE)
        0x15    DWORD        The file's uncompressed size.
        0x19    DWORD        The file's compressed size.
        0x1D    BYTE        Compressed file header size.
        0x1E    0x12 BYTES    Null.
        0x30    BYTE        Indicates whether the data is uncompressed. {0x00 = compressed, 0x01 = uncompressed}
        0x31    BYTE        The same as for *.iea files.
    End Structure
    
    
    Notes:
    *** Bits: YYYYYYYMMMDDDDD
    ****** Bits: HHHHHMMMMMSSSSS - Seconds are stored in two second intervals.
    Last edited by Peter Swinkels; 21st July 2018 at 14:29. Reason: typo - fixed a mistake

  40. #30
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Quote Originally Posted by Peter Swinkels View Post
    Here are more files to experiment on: https://drive.google.com/open?id=1Jk...WVjN5FUCTMkWwi
    Done, see attachment. Also, the endianness is little-endian, not big.

  41. Thanks:

    Peter Swinkels (21st July 2018)

Page 1 of 2 12 LastLast

Similar Threads

  1. Help on identifying DOS file encryption/packer
    By theruler in forum Data Compression
    Replies: 4
    Last Post: 15th June 2017, 17:53
  2. Finding custom lzss on arcade game .dat file
    By finalscream in forum Data Compression
    Replies: 3
    Last Post: 11th June 2017, 03:40
  3. Help on an old dos PAK file
    By theruler in forum Data Compression
    Replies: 2
    Last Post: 23rd January 2017, 10:02
  4. Help on compressed file from OLD DOS game
    By theruler in forum Data Compression
    Replies: 3
    Last Post: 16th August 2015, 12:18
  5. Replies: 6
    Last Post: 24th April 2012, 13:50

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •