Results 1 to 12 of 12

Thread: Presentation and asking some tips

  1. #1
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts

    Presentation and asking some tips

    Hi all, i am new here posting. But reading the forum for long time. Always interested in compression (and Information Theory in general) but never coded a compressor or something similar. Used libraries when i needed, but never invented something like you guys

    I am writing here because I need some hints about some experiments I am doing here, and because standard actual compression stuff is not right here.

    I'll need to save space compression very short strings (between 10 chars and less than 255) but ... I need it (if possible) to get it working in an Intel 8088 processor.
    Yeah, i know what you think. I am interested in old computer programming, and the idea is to have it working in an original IBM PC. The project is not mine, but i am trying to help, and because i love old computers and fancy problems.

    I've checked shoco compressor, and maybe i try to implement it in x86 asm (remember 64kb segments, 16bit registers).

    The problem is: i got around 17.000 strings between 10chars and 255 chars in length. Sorted alphabetically, standard ascii text (filenames), and i want to implement a bigram index, for using typeahead searching in a 8Mhz processor.

    I've also looked Smaz but it's word and it seems it haves to much code and data to use it.

    Also thought on packing the text lines in blocks of, let's say, 1kb, and then use LZ4 to access the block, decompress it and access the line. But it adds too much disk i/o.


    I know guys you are experts here, and maybe you could hint me in the right direction. Of if you think this is impossible or not feasible just tell me, then i'll maintain the text uncompressed.

    Thanks for your time reading this little question of mine.

  2. Thanks:

    snowcat (22nd May 2017)

  3. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Alphabetically means you could probably do something really trivial such as prefix length (frog, frogger, frolic, frolicking) to (frog, <4>ger, <3>lic, <6>king), maybe a predefined dictionary on phrases that appear commonly ("ing", "ment", "tion" etc in English words) and some simple minimal entropy encoding. The huffman part of "deflate" is a good place to study for compact entropy encoding that runs on old hardware. All of these ought to be quick and easy to get going on ancient hardware. Better entropy encoding is possible of course (ANS, Arithmetic, etc) but usually requires divisions somewhere, or is slower due to being a bit-based codec. It all depends on what the strings look like though, whether you require the ability to do random access and how fast it has to be.

    I guess the first thing to do is to just try basic generic tools such as gzip, lzma, bsc, etc and see what compression ratios are realistic. If you can get in the right ballpark with a custom but simple algorithm then it's likely good enough and not worth chasing the ratio crown.


    Edit: sorry I see bigram index. That presumably means you want random access then. Some form of compressed suffix array maybe. Fm-index may help you.

  4. #3
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts
    Yes, I need random access, because 8MHz and slow HD is what i have. Strings are standard file names. You start typing the name of a game and the programs start showing candidates. It's not a mandatory feature. But i think it could be a nice thing to have, but i don't know if it's really feasible in that hardware.

    I'll search what you told me. And maybe try shoco. Compressing in windows and just a small decompressor in the 8088. Thanks jamesb!

  5. #4
    Member
    Join Date
    Jul 2008
    Location
    Netherlands
    Posts
    16
    Thanks
    2
    Thanked 1 Time in 1 Post
    Reading this and always interested in these kind of questions:

    I think you should focus on indexing the data. The data is a bit too much for the 8088. Based the statistial average of the length, your raw data is over 2MB in size. It could be compressible to a lot less, but you dont have the cpu power to process it. I think the faster and more efficient way is to index the data and make (lz4) chunks for (for example) every file starting with a, b, c etc. Then you can decompress whats needed. LZ4 is fast enough for that and there is an 8088 implementention is assembler.

    If the filenames are fairly standard (i.e. a-z, <space>, <dot>, <amp>, <apos>, numbers, total of 40 chars max case insensitive) and no other characters used you could compress 3 bytes into 2 bytes, making index access a bit faster. It only requires 2 mul's and 3 adds (and a simple lookup table).

    E

  6. #5
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts
    Thanks for tips edcassa, that is one of the ideas i was thinking prior to start hacking some stuff, made blocks, and look for a trade-off between random access and space saved.

    File is exactly 1173K. You could ask, why don't simple let the file uncompressed. This is going to supposely work in a HD (20-40MB), not from 360KB floppies.

    Well i was thinking, and this is not related to compression, in the linked list of bigrams pointing to the strings. Being bigrams, the list of lines where each bigram is, could be large.
    So i also have to retrieve these lists, intersect them to show results.

    Also nice idea of compressing 3chars in 2 bytes. Let's see if I can make this working (if not no problem because there is an already working search word-based).

    Thanks guys. If you have other crazy ideas, just tell me!

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    779
    Thanked 687 Times in 372 Posts
    1. how much efforts you can give to programming that?
    2. what is required time-to-decompress?
    3. if i correctly understand, you just have 26*26 blocks, each require full decompression for usage?

    overall, you should use shared "dictionary" scheme. f.e. example, for defalte algorithm, dictionary may be a set of common byte sequences and/or huffman table. when you need to extract one of your 26*26 data sets, you first read "dictionary" used to compress this concrete dataset, then decompress the data using this "dictionary"

    "common byte sequences" part of dictionary can be built by the zstd "dictionary builder"

  8. #7
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts
    Thanks for answer. I am going to diverge a bit about compression, as compression is just part of the problem (help to aliviate it)

    Warning: long rant here skip if you think is silly idea

    I have 17.000 files, names of DOS games. Example:

    >Cybersphere (1996)(Psycon Software) [Action]
    >Cybersphere Plus (1997)(Psycon Software) [Action]
    >Cyberwars v1.1 [SW] (1992)(Emerging Ventures and Arts) [Action]
    >Cyborg [DC] (1982)(Sentient Software) [Adventure, Interactive Fiction]
    >Cybot (1991)(Ed T. Toton III) [Action]
    >Cybotron (Es) (1989)(Infodisc) [Action]
    >Cycle Warz [SW] (1995)(Vectorscope Software) [Action]
    >Cycles- International Grand Prix Racing (1989)(Accolade, Inc.) [Racing - Driving, Sports]
    >Cycles- International Grand Prix Racing [h1][b1] (1989)(Accolade, Inc.) [Racing - Driving, Sports]

    This text file is 1173KB. As i said this is peanuts today, not in a 1981 IBM PC

    The idea is implement a type-ahead search in this data. I've contemplated some options (tries, fuzzy searchs, prefix and suffix tries, ngrams, etc). I'm trying know a simple bigrams->title inverted index (intersecting bigrams as you type them). As the rest seems too memory or processor hungry for this system.

    As disk retrieval is slow as hell. I though it could benefit a lot compressing the titles. So when i have to retrieve them from disk, i'll save some precious time.

    I've looked choco, smaz and lz4 compressors. As more complex ones could be too slow, and memory and register usage are slow, so...

    There is already a working prototype indexing words only. This is not my project but i am trying to help it, I always liked constrained platforms because you have to thought a lot about implementation, i guess guys here also could love tinker this things out

    So my idea is

    Bigram index->list of pointers to titles (this could be very long, so maybe i could try to do a simple compression here, for example diferential enconding, substracting the pointers and packig in bits)

    There are only 1080 bigrams (in the space of 0xffff). So i could put them sorted in this way bigrams/pointers to list of tiles (each a word). So with a rep scasw i could get the pointer to bigrams (and getting the list, decompressing and storing in memory ready to intersect with next bigram list of pointers)

    Retrieving titles to show them in the interface could be slow. So best idea i have is the one edcassa hint me. Packing titles in chunks, let's say 512bytes max per compressed chunk (sector size). There is a very fast LZ4 implemented, already in 8088 asm (http://lz4.github.io/lz4/).

    As i said all of this is in the air. I'll try LZ4 and this way this weekend and try it in a IBMPC emu to see if it's workable.

    Thanks guys and excuse the off-topic and long rant.

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    779
    Thanked 687 Times in 372 Posts
    for a man, 0.2s delay is "immediate". unarjz 0.15 can decompress data at 100 KB/s, so you can extract 20 KB in this timefarme

    so i propose to sort strings, split them by 2 first chars, then combine slices into chunks up to 20 KB, and compress every chunk with arjz. additionally, you can build shared dictionary for all these chunks with zstd tool

  10. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    For a simple implementation, you can do the following:
    - Split your sorted string file into fixed size blocks (ex. 64k)
    Each block must begin with the start of a string.

    - Compress each block using a general purpose compression library (zlib,lz4,...)
    - Store the compressed variable length blocks in a file
    - Maintain a memory resident array with the first string of a block, the block file offset and the compressed length.

    You can then use a binary search in the internal array and access/decompress (and search) the block using the corresponding file offset.

  11. #10
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts
    Guys, thanks for all ideas! I have some coding to do here to test things!
    Regards!

    edit:

    Bulat, for zstd tool you mean generating a dict from data with this https://github.com/facebook/zstd ?
    But... could i mix a lz4 compression (or ARJZ) with the dictionary of zstd ? or am i mixing things in my mind?

    thanks!

  12. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    779
    Thanked 687 Times in 372 Posts
    afaik, this dictionary is just a common set of used byte sequences. so you can use LZ with dictionary preceding real data. of course, if you don't know how LZ works, it may be hard to implement

  13. #12
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    17
    Thanks
    19
    Thanked 3 Times in 3 Posts
    ahhh, ok! i understand now, thanks!! You guys rocks!

Similar Threads

  1. Compressor benchmark : trying a new presentation
    By Cyan in forum Data Compression
    Replies: 12
    Last Post: 7th April 2011, 13:27

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •