Results 1 to 9 of 9

Thread: Pre-compressor project. Seeking collaboration.

  1. #1
    Member
    Join Date
    Aug 2016
    Location
    Lisbon
    Posts
    19
    Thanks
    4
    Thanked 8 Times in 8 Posts

    Talking Pre-compressor project. Seeking collaboration.

    Hi fellows

    I believe a pre-compressor could be done, that represents the same information in less bytes, but without getting it into random data. Just like a dictionary compressor and some (almost a dozen) added concepts. Then running through paq should compress more than the initial data, as it weights less than that. Razor Occam's right?

    I had an injury for 2 months which allowed me to start working on it, but now I have to work again and I think I won't finish it in years (also the concepts and programming are very hard to implement for my skill level). I looked for help of friends but they either lack skill or time.

    Skills needed beyond OOP programming are regex, basic algorithms and math multicombinations. It's on PHP right now because it's the language I worked with for the last years. But it's just a draft easily translatable for a fresh start in C. It's like 12 classes over 2k lines so far. It wouldn't take double than that.


    I could only test on parts of the project, but I already have think of all the implementation, which could be in the future the basis for an IA-like program that compresses.


    For example, taking only all the words in the enwik weight like 80mb and could be compressed into 18mb in my first trials. I thought of a way of improving this that would shrink size a few mb.
    I believe that symbols weight 10mb and could be compressed on less than 3 mb.
    The other 10mb are part of the added concepts and should be addressed with the mixin of the concepts, weighting a negligible size, like would be the size of the program <1mb.

    I believe it could fit to the max of 1gb mem use (for the hutter prize) and decompress in less than an hour on a modern computer, making it suitable for end users so a potential later bussiness.

    Compression could take 3 days for enwik8 on uncapped resources modern machine (cpu i7@2.5-3.2, 8Gb ram). But the point is compress once, decompress a lot, right?

    So I'm looking for one or a max of 2 collaborator to finish my idea into code.
    It should be a committed trustworthy loyal person with the skills previously mentioned.
    Work could take few months of half day spare time. If interested send me a PM, with your github profile if you have or code samples.

    Best case, we start by winning the hutterprize, worst case, we do a compressor better than winrar



  2. #2
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    895
    Thanks
    54
    Thanked 109 Times in 86 Posts
    Every so ofte this forum get some new person which first post is about some technical breakthorugh comrpession theroy that should work but always seems to be missing the programmer to make it.
    All of these has so far been theories full of flaws and misunderstanding of what data is and how simple math works.

    now I'm not saying this is the situation here but its very hard to see the difference, so you are starting an uphill battle on the PR side. You post shows nothing of indication that you really have anything worthy of peoples attention. please let me explain this.
    you don't explain anything about what this idea is. You mention some compression data numbers but again with no argument or evidence where these numbers comes from.
    also I'm not even sure you are using Occams's razor in the correct context here.


    So even if you have a great new idea, the way you represent it makes you look just like all the other "magicians" out there that think they somehow broke math.
    Without further info its probably going to be hard to get people to work on your project

  3. Thanks:

    Cristo (7th August 2016)

  4. #3
    Member
    Join Date
    Aug 2016
    Location
    Lisbon
    Posts
    19
    Thanks
    4
    Thanked 8 Times in 8 Posts
    Part of the process of discovering and inventing is throwing magic ideas.
    Then once in a while some of them work. Or maybe I'm just a fool again.

    I'll briefly explain my idea of processor. Usually the dictionaries are limited in size or by any other factor, like computational. I thought of how to get into dictionaries or process all type of data and info in the enwik file, so everything could be represented in less bytes than it is, but without converting it in random-like/binary data, so the pre-compressor just changes the way in which the data is represented using dictionaries and other techniques, but is not strictly compressing. Just getting the lesser value to represent it. (That was the occam principle similarity, shaving the symbols to the simpler possible, which would be obvious but is not done)

    The various classes or categories are: wikipages, xml, tags, markdown, phrases, words, numbers including things like dates and ips, punctuation marks. Then there are content mixer/un-mixers.

    The resulting file weights less than the original but is not compressed yet. Then a binary compressor should compress it more than the original file because it doesn't care about what the symbols represent.

    Compare for example:
    Lorem ipsum Lorem ipsum Lorem ipsum

    and

    Lorem
    ipsum
    lilili

    being the program size negligible the second one should be compressed less.
    Here for words. Same for all the classes of data.

  5. Thanks:

    SolidComp (18th August 2016)

  6. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Have you looked at XWRT?

  7. Thanks (2):

    Cristo (9th August 2016),SolidComp (18th August 2016)

  8. #5
    Member
    Join Date
    Aug 2016
    Location
    Lisbon
    Posts
    19
    Thanks
    4
    Thanked 8 Times in 8 Posts
    I didn't know about that, will take a look at the github next weekend I can. Seems like I had a similar idea for one of my concepts.


    XWRT 3.4 (5.11.2007) - XML compressor by P.Skibinski, inikep@gmail.com
    * Compression level=13
    - encoding enwik8 to enwik8.xwrt (lpaq6 774 MB)
    warning: dictionary too big, you can use -b option to increase buffer size
    + dynamic dictionary 352133/524288 words
    + loaded dictionary 16639/559168 words
    + dynamic dictionary 16639/559168 words
    + encoding finished (100000000->18805611 bytes, 1.504 bpc) in 79.42s (1230 kb/s)
    Last edited by Cristo; 9th August 2016 at 20:39.

  9. #6
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    242
    Thanks
    96
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by Cristo View Post
    Part of the process of discovering and inventing is throwing magic ideas.
    Then once in a while some of them work. Or maybe I'm just a fool again.

    I'll briefly explain my idea of processor. Usually the dictionaries are limited in size or by any other factor, like computational. I thought of how to get into dictionaries or process all type of data and info in the enwik file, so everything could be represented in less bytes than it is, but without converting it in random-like/binary data, so the pre-compressor just changes the way in which the data is represented using dictionaries and other techniques, but is not strictly compressing. Just getting the lesser value to represent it. (That was the occam principle similarity, shaving the symbols to the simpler possible, which would be obvious but is not done)

    The various classes or categories are: wikipages, xml, tags, markdown, phrases, words, numbers including things like dates and ips, punctuation marks. Then there are content mixer/un-mixers.

    The resulting file weights less than the original but is not compressed yet. Then a binary compressor should compress it more than the original file because it doesn't care about what the symbols represent.

    Compare for example:
    Lorem ipsum Lorem ipsum Lorem ipsum

    and

    Lorem
    ipsum
    lilili

    being the program size negligible the second one should be compressed less.
    Here for words. Same for all the classes of data.
    Cristo, how familiar are you with Huffman coding? I like where you're going with this, but be advised that Huffman coding already achieves something much like the compact dictionary that you describe. Deflate uses LZ77 for the string matching, and then compresses the symbols with Huffman coding. So for example, if we have an often repeated string like:

    <div class="
    we're not actually going to store it as that literal 12-byte string. It will be assigned a much smaller symbol.

    Your idea would reduce the uncompressed size, if the compact form were directly processed by the consuming application, which is helpful (less RAM, faster parsing). But as Kennon said, XWRT covers some of this ground, as does the XPack XML compressor. Check it out.

  10. #7
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    I think a good idea is the pre compressor know which compressor will be used before the pre compression.

  11. #8
    Member
    Join Date
    Aug 2016
    Location
    Lisbon
    Posts
    19
    Thanks
    4
    Thanked 8 Times in 8 Posts
    @SolidComp, I read but not practised almost anything from the matt mahoney website/book, covering almost all types of coding. I know huffman, but not so familiar. I'm noob with LZxx but maybe describing the way it works it rings the bell. I know the theory of the sliding windows.

    One of the points of my idea would address the tags like div getting them as a dictionary entry. I read that xwrt does something like that. I tried to read the code at github but is all one file 2k lines and a bit thick.

    For the previous answers I was going to code a proof of concept next so I could get people interested. I was going to use something like huffman at the end of the dict replace or at least the concept of the least bytes for the most common entries.
    I am very bad with math/theory and names (of coding, etc...) I'm more like a dirty hacker, logic and know some algorithms. (I have a Vocational Training on IT systems and worked as programmer)

    @lunaris or just use the best compressor for the output of the precompressor. Like I choosed paq_kx after reading on benchmarks that it's the best for binary data. But I'm rethinking this as I change the implementation

  12. #9
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by Cristo View Post
    @SolidComp, I read but not practised almost anything from the matt mahoney website/book, covering almost all types of coding. I know huffman, but not so familiar. I'm noob with LZxx but maybe describing the way it works it rings the bell. I know the theory of the sliding windows.

    One of the points of my idea would address the tags like div getting them as a dictionary entry. I read that xwrt does something like that. I tried to read the code at github but is all one file 2k lines and a bit thick.

    For the previous answers I was going to code a proof of concept next so I could get people interested. I was going to use something like huffman at the end of the dict replace or at least the concept of the least bytes for the most common entries.
    I am very bad with math/theory and names (of coding, etc...) I'm more like a dirty hacker, logic and know some algorithms. (I have a Vocational Training on IT systems and worked as programmer)

    @lunaris or just use the best compressor for the output of the precompressor. Like I choosed paq_kx after reading on benchmarks that it's the best for binary data. But I'm rethinking this as I change the implementation
    It's a good idea allow to choose which compressor to use because each compressor needs different pre compressor options.

    For example ,A LZ77 precompressor sometimes does not work well using paq compressors.
    Last edited by lunaris; 21st August 2016 at 02:59.

Similar Threads

  1. Replies: 187
    Last Post: 21st April 2018, 04:46
  2. Seeking image encoding benchmark
    By boxerab in forum Data Compression
    Replies: 4
    Last Post: 30th October 2015, 15:31
  3. ZPAQ pre-release
    By Matt Mahoney in forum Data Compression
    Replies: 54
    Last Post: 23rd March 2009, 02:17
  4. Updated PeaZip project: 1.5
    By giorgiotani in forum Forum Archive
    Replies: 8
    Last Post: 21st March 2007, 14:02
  5. OpenDark project
    By kvark in forum Forum Archive
    Replies: 5
    Last Post: 18th November 2006, 01:48

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •