Page 11 of 13 FirstFirst ... 910111213 LastLast
Results 301 to 330 of 361

Thread: EMMA - Context Mixing Compressor

  1. #301
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,051 Times in 564 Posts
    > Could a moderator be so kind as to update the first post with the new link? Thanks in advance

    I added the link, but is that enough? You can edit the whole post and send me the new version.

  2. Thanks:

    mpais (5th February 2017)

  3. #302
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Great improvements. I've started to test this version.
    I have a question to command line mode.

    >You can now use the command line to (de)compress files:
    "C i input_file output_file", i is the index of the preset to use when compressing (0 based), and "D input_file output_directory" for decompression.

    For my testbed I use 28 different presets. I found that I could modify EMMA.ini file and add extra presets to cover my testbed and this allow me to some made automate to test EMMA.
    But from other side if it's possible to add reading setting parameters just from command line?
    For example: DATA0=XXXXX, DATA1=XXXXX, EXTRA=XX

    Darek

  4. #303
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    >But from other side if it's possible to add reading setting parameters just from command line?

    Sure, but I guess you'd still have to use the GUI to edit the presets, so I though it would be easier to just use a preset and be done with it. I didn't really want to waste too much time on this, EMMA uses 68 bits just for the options, the amount of switches to allow configuration of every option by command line parameters would be huge. I'll add your way of doing it to the next version.

    Do you have any suggestions on what to improve next? Maybe some interesting file formats to support? I'm planning on revamping the 24bpp image model with some of the techniques I used on the grayscale model and going for maximum compression (within reason), but that will take a long time given my current work schedule.

    Best regards

  5. #304
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    >Do you have any suggestions on what to improve next?
    I'll think about it and write.

    As for presets number - could you increase supported presets number from 20 to f.e. 50 or 64?

  6. #305
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    >As for presets number - could you increase supported presets number from 20 to f.e. 50 or 64?

    Done, increased it to 128, you'll just have to redownload it.

  7. Thanks:

    Darek (5th February 2017)

  8. #306
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Quote Originally Posted by mpais View Post
    >As for presets number - could you increase supported presets number from 20 to f.e. 50 or 64?

    Done, increased it to 128, you'll just have to redownload it.
    Works great! I've made EMMA.ini with all my presets and it runs properly - has the same scores as from manually settings.
    One idea to reconsider - after running from command file EMMA compress file and stops but program is still turn on. This does not allow to run next command from batch file. I need to exit program manually then next batch command line is running.

    Maybe when EMMA starts from command line it should exit/return from program after finishing to compress?

  9. #307
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    >Maybe when EMMA starts from command line it should exit/return from program after finishing to compress?

    Done, EMMA will exit if no error occurs, unless you stop it. Should have thought of that earlier, sorry

  10. Thanks:

    Darek (5th February 2017)

  11. #308
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    571
    Thanks
    219
    Thanked 204 Times in 96 Posts
    About 10000 flif calls later...

    Code:
    								 flifcrush    EMMA 0.1.21 x86
    
      big_building		-R4 -T46 -M0 -D12 -X12 -Z42 -P0 -N	16.287.946	15.410.195
      big_tree		-R5 -T60 -M95 -D12 -X17 -Z51 -P0 -N	12.249.530	11.782.444
      bridge		-R3 -T55 -M205 -D1 -X9 -Z63 -P0		 5.581.163	 5.322.558
      cathedral		-R8 -T108 -M60 -D27 -X20 -Z48 -P0 -N	 2.519.051	 2.395.178
      deer			-R12 -T144 -X128 -Z128 -D1 -M5025	 5.879.039	 5.611.606
      fireworks		-R5 -T81 -M130 -D12 -X14 -Z40 -P0 -N	 1.211.060	 1.147.991
      flower_foveon		-R7 -T108 -M60 -D18 -X13 -Z38 -P0 -N	   790.002	   732.995
      hdr			-R5 -T89 -M110 -D6 -X24 -Z46 -P0 -N	 1.576.821	 1.462.115
      leaves_iso_200 	-R7 -T92 -M25 -D29 -X18 -Z38 -P0 -N	 2.639.300	 2.458.213
      leaves_iso_1600	-R3 -T37 -M5 -D13 -X20 -Z39 -P0 -N	 3.217.564	 3.073.740
      nightshot_iso_100	-R6 -T102 -M70 -D13 -X21 -Z47 -P0 -N	 1.763.188	 1.664.166
      nightshot_iso_1600	-R6 -T74 -M45 -D18 -X12 -Z52 -P0 -N	 3.482.331	 3.357.464
      spider_web		-R3 -T45 -M25 -D29 -X9 -Z27 -P0 -N	 2.306.548	 2.080.180
    
      total								59.503.543	56.498.845
    
      artificial		-R8 -T108 -M15 -D34 -X11 -Z16 -P0 -N	   376.292	   268.072
      zone_plate		-R7 -G3 -T80 -X1 -D24 -M15		 2.728.618	   159.494
    Conclusions:
    - FLIF can be improved ("pure" -e result was 61.611.877), but there's still a 5% gap to EMMA
    - flifcrush can be improved - it uses a "small step" strategy, so it tries parameters like -T17, -T18, -T19, ... - better strategy is to assume parabolic behaviour and do something like a binary search. Also it crashed with bad_alloc on nightshot_iso_1600 and got stock on spider_web with a 0 byte result. I might try to write my own C++ FLIF brute force tool.
    - Digital Ocean rocks
    http://schnaader.info
    Damn kids. They're all alike.

  12. Thanks:

    Stephan Busch (6th February 2017)

  13. #309
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    EMMA v0.1.22, available at https://goo.gl/EDd1sS

    Code:
    Changes:
    - Fixed a bug in one of my data structures, used mostly for the image models
    - Improved the 24bpp image model by plugging part of the 8bbp grayscale model into it
    Found a rather dumb bug, it didn't prevent correct decompression but caused some contexts
    stats to get corrupted, which was hurting compression.

    So, updated results for the 8bpp grayscale model:
    Attachment 4826 Attachment 4827

    And some results with 24bpp images:
    Code:
    24 images from the Kodak Image Set, in BMP format, total for all files
    7.411.197 bytes, EMMA 0.1.21 x86
    7.292.996 bytes, EMMA 0.1.22 x86
    
    kranks.ppm, from SqueezeChart Non-Photographic Image Compression
    1.074.218 bytes, EMMA 0.1.21 x86
    1.066.849 bytes, EMMA 0.1.22 x86
    
    painting-acryl.ppm, from SqueezeChart Non-Photographic Image Compression
    16.998.751 bytes, EMMA 0.1.21 x86
    16.751.495 bytes, EMMA 0.1.22 x86
    
    7 images from SqueezeChart Photographic Image Compression, in PPM format, total for all files
    89.821.694 bytes, EMMA 0.1.21 x86
    88.627.928 bytes, EMMA 0.1.22 x86
    
    rafale.bmp, from Maximum Compression benchmark
    505.497 bytes, EMMA 0.1.21 x86
    496.141 bytes, EMMA 0.1.22 x86
    The x64 version may give slightly better results due to the increased memory usage.

  14. Thanks (4):

    Darek (24th February 2017),encode (17th May 2017),Hacker (12th February 2017),Mike (12th February 2017)

  15. #310
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Version 1.22 get another gain for my testbed and SILESIA corpus.
    Especially for BMP and TGA files:
    Code:
    1.BMP
    230'651 bytes, EMMA 0.1.21 x64
    225'879 byets, EMMA 0.1.22 x64 - best score ever, this version beats old record set by GRALIC v1.7
    
    B.TGA
    375'623 bytes, EMMA 0.1.21 x64
    366'424 byets, EMMA 0.1.22 x64 - best score ever, previous best score was also from EMMA
    
    SILESIA
    34'225'709 bytes, EMMA 0.1.21 x64 -  pure, not precompressed files
    34'216'811 bytes, EMMA 0.1.22 x64 -  pure, not precompressed files
    I've also tested ENWIK8 and ENWIK9. Best settings for new ppmd tested on ENWIK8 was for 1024MB of memory and order 14. For these parameter I've got:

    pure scores:
    ENWIK8: 17'236'798 bytes, time 15150,8s
    ENWIK9: 140'622'734 bytes, time 114208,8s

    DRT preprocessed:
    ENWIK8: 16'679'420 bytes, time 9770,7s
    ENWIK9: 135'313'023 bytes, time 86548,7s

    and special optimal case - preprocessed DRT ENWIK9_4123 file (preprocessed file splitted and merged in different order - see previous EMMA score in LTCB):
    ENWIK8: 16'679'420 bytes, time 9770,7s
    ENWIK9: 135'169'967 bytes, time 86187,0s - 1MB beter than previous best score!

    Decompressor batch file size (attached in post) = 1'024'652 bytes compressed by 7zip.

    Then the final score for LTCB with my zip is: 136'194'619 bytes.

    Other information for LTCB input:
    System - Core i7 4900MQ 2.8GHz ovwerclocked to 3.8GHz, 32GB, Win7Pro 64
    Memory used: 3824MB
    EMMA 1.22 settings: all settings = MAX, eceept: image and audio models = off, use fast mode on long matches = off, xml=on, x86model=off, x86 exe code = off, delta coding = off, dictionary = off, ppmd memory = 1024, ppmd order = 14

    Regards,
    Darek
    Attached Files Attached Files
    Last edited by Darek; 24th February 2017 at 19:50.

  16. Thanks:

    mpais (24th February 2017)

  17. #311
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    @Matt - could you post this result to LTCB? I think all information are included above. Maybe it's not a position change but I'think, quite big improvement of EMMA

    @Mpais - some my ideas to further EMMA development (don't bother if there are stupid):

    1. improve bit audio model - I know it's now good, however optimfrog could in some cases (my testbed file) get slightly better results - 4% f.e.
    2. improve text model - yes, precompression is not good option now for one pass/stream compression like EMMA but maybe there some tuning methods to improve LTCB scores?
    3. recognizing text parts inside bigger files - if there no such option already implemented,
    4. is there something strange with two files from my testbed in EMMA compression:
    --- H.EXE - best EMMA score= 472'642 when CMIX got 447'314 (6% better). EMMA generally have very good exe compression due to x86 filter and x86/x64 model but for this file got worse score than other CM compressors
    --- Q.WK3 - best EMMA score = 190'158 when CMIX got 166'294 (14% better) but it's posiible to get lower than 15x'xxx bytes which means 20%+ advantage over EMMA. It's look like the learning algorithm don't work as efficiently for this file like for other CM compressors...
    5. PDF parser - if possible
    6. TAR files parser - if possible
    7. as you wrote in one earlier post - two pass compression in future which allow EMMA to parse some additional standards (like some TIFF files) or use text trasformates.
    8. if it's silly that don't laugh please - this is probably also idea for two pass compression - maybe, instead to trying to predict next bytes of file, is a way to such restructure/reorganize the file (file parsing) to make it more predictible to the model. For example from very diverse file structure try to make it more smooth, less variable with softer or slower changes. Good example is ENWIK9 splitting and reordering for 4+1+2+3 parts (from less compressoble to more) which adds some gain versus original order.

    Regards,
    Darek

  18. #312
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    >>1. improve bit audio model

    I plan on doing it, it's just a low priority for me.

    >>2. improve text model

    It's what I'm doing currently, Shelwien talked me into shifting my focus from image compression to text compression.

    >>3. recognizing text parts inside bigger files - if there no such option already implemented,

    That would be better suited to the preprocessing stage.

    >>4. is there something strange with two files from my testbed in EMMA compression:

    Well, H.EXE is a 16-bit executable file (Aldus PhotoStyler for Win3.1), so my model probably goes completely crazy since it tries
    to interpret the stream as a sequence of either x86 or x64 instructions. The models in paq8 and cmix just try to linearly get some
    simple contexts at several of the previous offsets, so they probably end up getting much better stats.

    As for Q.WK3, is it a Lotus 1-2-3 Spreadsheet File? I'd guess it's either my sparse model (really simple, not good at all) or the
    heuristic I use for my record model that is failling to detect the correct record length.

    >>5. PDF parser - if possible

    Again, better suited to the preprocessing stage. Embedded JPEG are already detected, and most other interesting embedded
    streams that can be recompressed would need preprocessing.

    >>6. TAR files parser - if possible

    TAR stores files uncompressed, so as long as the other parsers can detect the content, I don't see the point. EMMA doesn't use
    file extensions to identify file types.

    >>7. as you wrote in one earlier post - two pass compression in future which allow EMMA to parse some additional standards (like some TIFF files)
    or use text trasformates.

    I'll try to do it once I'm satisfied with my models. I've just recently started considering rewriting the models for maximum
    compression (within reason), like the new grayscale image model. Up until now I've just relied on the boost that the models
    get from ludicrous mode and on a few simple tweaks, but I guess if I'm doing a C.M. compressor I might as well make it as
    good as I can.

    >>8. if it's silly that don't laugh please - this is probably also idea for two pass compression - maybe, instead to trying to predict next bytes of file, is a way to such restructure/reorganize the file (file parsing) to make it more predictible to the model. For example from very diverse file structure try to make it more smooth, less variable with softer or slower changes. Good example is ENWIK9 splitting and reordering for 4+1+2+3 parts (from less compressoble to more) which adds some gain versus original order.

    If I do the preprocessing stage (similar to Precomp, recursively detect and decompress known formats) to do block segmentation,
    and coupled with the online detection routines I already have in EMMA, it should already give some nice gains.

    For data that uses special models, I already only use those models when a parser detects that data type, so as far as the models know,
    they see it as a contiguous stream. The question would then be about data in a decompressed block, such as an embedded deflate stream,
    which doesn't contain any recognizable formats. You could use order-0 stats to try ordering "unknown data" blocks, but that would most
    likely not give very good results.

    I currently have a model for ARM executables "on-hold" (probably not much interest there), and had though of making models for PNG
    and MP3 recompression, but those would be really complex and probably not get very good results (like the GIF model). And I should
    probably see if there are some more file formats that could use the existing models, so it would just be a matter of writing a few parsers.

    Best regards

  19. Thanks:

    Darek (14th March 2017)

  20. #313
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,051 Times in 564 Posts
    >> 1. improve bit audio model
    > I plan on doing it, it's just a low priority for me.

    I'd appreciate if you could think about my LPC mixing/"probability translation" idea.
    My default implementation needs N_Models*2 tree lookups per data bit, so its immediately
    very slow even without SSE and advanced mixing.
    But I think that there should be some smart solution for this.
    For example, mixing becomes fast enough if we'd use plain tables with cumulative frequencies,
    instead of binary trees. But then its hard to update the tables.

    Maybe I should post a coder using this method - like, with multiple instances of
    Florin Ghido's paq audio model or something.
    I actually have two such coders online, but beating optimfrog compression needs
    individual LPCs to be much smarter than what I have.

    >> 2. improve text model
    > It's what I'm doing currently, Shelwien talked me into shifting my focus from image compression to text compression.

    Yes, text is the most interesting data type of the common ones.
    That is, with text its actually possible to reach some reasonably high complexity level.
    Images and audio are harder simply because they'd (ideally) need a text submodel anyway,
    as there're audiobooks and book scans.
    Taking this into account, a really advanced image compressor would be impractically slow,
    so they never reach that stage.

    However, maybe it would be a good idea to make a new _console_ coder for experiments
    with text compression. Console, because GUI actually makes testing harder,
    and new, because it'd allow to make this model incompatible with generic bitwise coding method.

    For example, we could make a model for binary word indexes, instead of workarounds like WRT.
    And then, how about having word indexes for left-to-right and right-to-left sorting orders,
    and coding interleaved bits of each index, until it becomes possible to identify an individual word.
    (I guess, its also possible to use this method to build WRT codes).

    Anyway, the point is that paq-like bitwise framework is bad for text compression -
    for example, encoding of bit7 is simply redundant in many cases.

    > thought of making models for PNG and MP3 recompression, but those would be
    > really complex and probably not get very good results (like the GIF model).

    If you're not intending to turn Emma into a commercial project at some point,
    there're already dlls for mp3 recompression, and for raw deflate.
    (I can build dlls, if they are not already).
    http://nishi.dreamhosters.com/u/mp3rw_v1.rar (mp3 detector)
    http://nishi.dreamhosters.com/u/mpzapi_v1b.rar
    http://nishi.dreamhosters.com/u/rawfilt_v1l2.rar (new integrated reflate)
    http://nishi.dreamhosters.com/u/reflate_0c3a.rar (old reflate with raw2hif dlls)

  21. Thanks (2):

    Bulat Ziganshin (15th March 2017),Darek (14th March 2017)

  22. #314
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    Well, I guess I should go into detail on what I'm doing with EMMA, you always seem to have great insights Shelwien.
    I'm currently getting a 1.15% relative gain on enwik6 with just a few improvements, though I haven't tested with enwik8 yet.

    I've made a stemming engine for EMMA, currently with a single stemmer for english. In my research into the topic, I found
    a widely used suffix stemmer (Porter2), but it is somewhat simplistic, possibly for fear of overstemming. So I created my own
    stemmer and then merged both. This highly modified Porter2 affix stemmer provides me with very useful information for my
    model since, aside from the word stem, it allows me to classify words (nouns, verbs, adjectives, adverbs..). Example output
    from the stemmer (original word, stem [deduced meaning]):

    Code:
    comprehend, comprehend
    comprehended, comprehend [PastTense]
    comprehender, comprehend
    comprehendible, comprehend [Capable of]
    comprehending, comprehend [PresentParticiple]
    comprehendingly, comprehend [Manner of][PresentParticiple]
    comprehends, comprehend [Plural]
    comprehense, comprehens
    comprehensibility, comprehens [Attribute/Quality of][Manner of][Capable of]
    comprehensible, comprehens [Capable of]
    comprehensibleness, comprehens [State/Condition of][Capable of]
    comprehensibly, comprehens [Manner of][Capable of]
    comprehension, comprehens [Action/Process/Result]
    comprehensive, comprehens [Tendency/Disposition to]
    comprehensively, comprehens [Manner of][Tendency/Disposition to]
    comprehensiveness, comprehens [State/Condition of][Tendency/Disposition to]
    comprehensives, comprehens [Plural][Tendency/Disposition to]
    The plan is to create several stemmers, and when a word is found, the engine will ask every stemmer to stem it, and record
    whether each stemmer recognized the word. This, combined with some common stats about a few of the most used letters/symbols
    for each language, should allow the engine to have a reasonably good hunch about what is the language being processed, and
    it can then select what stemmer should provide the contextual stats.

    I've also made the model aware of possible syllabification patterns, and I'm working on modelling of parentheses, which are widely
    used in enwik*.

    I also plan on modifying the XML model to be aware of HTML entities in the tag content, like you suggested, that might also help
    on enwik*.

    >However, maybe it would be a good idea to make a new _console_ coder for experiments
    >with text compression. Console, because GUI actually makes testing harder,
    >and new, because it'd allow to make this model incompatible with generic bitwise coding method.

    Sure, I've also considered making a console version of EMMA, since Darek seems intent on always
    testing it for the LTCB, so that might give him an extra 300KB gain
    And a new coder just for text is certainly enticing, but at most I'd just help out on it, I barely have
    time for EMMA as it is. I'd still like to try some of my models in paq8 and cmix, and write some
    proper documentation, I'm almost at 40.000 LOC and I'd hate to come back to this in a few years
    and have no idea why I did something like this or like that.

    >If you're not intending to turn Emma into a commercial project at some point,
    >there're already dlls for mp3 recompression, and for raw deflate.

    EMMA is just a little side project for fun, I have no aspirations for it other than researching and
    learning new techniques. I believe the knowledge gained from it might have value, but I honestly
    don't see a market for another commercial compression software, not when you have such great
    open-source alternatives for general purpose usage.
    Last edited by mpais; 14th March 2017 at 23:27.

  23. Thanks (2):

    Darek (14th March 2017),Shelwien (15th March 2017)

  24. #315
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Extra 300kb of gain on enwik9 sounds nice...

  25. #316
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,051 Times in 564 Posts
    @Darek: he meant enwik8 I think

  26. Thanks:

    Darek (15th March 2017)

  27. #317
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    287
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    >> 1. improve bit audio model
    > I plan on doing it, it's just a low priority for me.

    I'd appreciate if you could think about my LPC mixing/"probability translation" idea.
    My default implementation needs N_Models*2 tree lookups per data bit, so its immediately
    very slow even without SSE and advanced mixing.
    But I think that there should be some smart solution for this.
    For example, mixing becomes fast enough if we'd use plain tables with cumulative frequencies,
    instead of binary trees. But then its hard to update the tables.
    Do you have a link for this? Or do you care to elaborate more?
    OptimFrog is really good an some files nearly independent on the prediction used, because it uses special probability distributions for the input PCM values.
    For example "male_speech.wav" from http://rarewares.org/test_samples/

  28. #318
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,051 Times in 564 Posts
    > Do you have a link for this?

    Well, here's my coder for calgary geo file:
    http://nishi.dreamhosters.com/u/FLTgeo3b2.rar
    See process.inc etc

    > Or do you care to elaborate more?

    Why, its really simple.
    1. Mixing is good
    2. Analog data are compressed with LPC models.
    3. In LPC, entropy model is applied to residuals,
    so its impossible to directly mix multiple LPC models.
    4. So we have to translate the probability distributions to the same space.

    For example, if LPC1 gives prediction d1, and residual 0 has probability p1,
    then it could be mixed with p2, which is the probability of residual d1-d2 of LPC2.

    The question is how to implement this efficiently.

    > OptimFrog is really good an some files nearly independent on the prediction used,
    > because it uses special probability distributions for the input PCM values.

    My main problem with audio models was always the high resolution.
    If we'd reduce it down to 4khz or so, my CM would probably beat optimfrog.
    But normally its necessary to take into account 100s of previous samples, which can't be done
    with a dumb context model.

  29. #319
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    @Darek

    On the LTCB, the decompressor size is taken into account. Creating a simple console decompressor should reduce the size of your zip considerably,
    hence giving a better score for the same enwik9 compressed file size.

  30. #320
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Quote Originally Posted by mpais View Post
    @Darek

    On the LTCB, the decompressor size is taken into account. Creating a simple console decompressor should reduce the size of your zip considerably,
    hence giving a better score for the same enwik9 compressed file size.
    That would be great! At now zip file compressedy by me have about 1MB, Matt's zip have 1.3MB...

  31. #321
    Member
    Join Date
    Oct 2016
    Location
    Slovakia
    Posts
    22
    Thanks
    37
    Thanked 3 Times in 3 Posts
    Is there a limit on file sizes? I wanted to try EMMA out on a ~31 GB file but it said it was too large?
    Thanks.

  32. #322
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    Yes, I've limited EMMA to 4GB maximum file size (unsigned 32 bit integer), and honestly even that seems too much.
    EMMA is too slow, even on the "Fast" preset. It's purpose is only to serve as a testing ground for new ideas, it's not
    suitable for any real usage.

    If you really want to compress such huge files, you can always divide them in smaller chunks and compress each chunk
    separately, but for a 31GB file, you'll be waiting a long, long time for the results

  33. Thanks:

    Hacker (17th March 2017)

  34. #323
    Member
    Join Date
    Oct 2016
    Location
    Slovakia
    Posts
    22
    Thanks
    37
    Thanked 3 Times in 3 Posts
    Thank you for the explanation. I have survived testing ZPAQ and NanoZip and others against this data, I would survive compressing with EMMA, too.

    I would appreciate if you could lift the limitation in some new version, though, since, honestly, nothing compresses my ARW and RW2 files like EMMA does - about 60% of the original size at 170 KB/s for the ARW's and about 55% at 210 KB/s for the RW2's. These are huge savings, especially considering I have 2 TB of them so far.

    I understand EMMA is experimental and not meant for any production use but I don't really need much, as long as the same .exe can unpack the same archive 10 years later, even if only in a VM.

  35. #324
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    I've briefly discussed this with Stephan Busch, that a dedicated "ARW\RW2\other raw images" compressor (something like a "PackRAW") would probably get ratios close to EMMA and be much, much faster. On RW2 files, EMMA's model is sub-par, because the dumb obfuscation used by Panasonic hurts a streaming compressor, so the ratio would probably be even better.

    If there was interest in such a project, I'd gladly help, but it is unfortunately too much to take on all by myself for the moment.

    I'll see about changing the file size field to 64-bits, but I'll probably keep at least a warning when processing such large files, I don't want people to think EMMA is stable and that any files they backup with it will be decompressed on later versions.

  36. Thanks:

    Hacker (17th March 2017)

  37. #325
    Member
    Join Date
    Oct 2016
    Location
    Slovakia
    Posts
    22
    Thanks
    37
    Thanked 3 Times in 3 Posts
    Thank you again. Yes, a RAW compressor would be very much welcome and there certainly is interest (I'd imagine professional photographers wouldn't mind having their image file sizes cut losslessly by half, either), unfortunately my skills are not high enough to be of any help except for testing. I have 1.32 TB of ARW files, 705 GB of RW2 files and 41 GB of CR2 files at my disposal so if anyone decides to take up this project, I am ready.

  38. #326
    Member
    Join Date
    Mar 2016
    Location
    Croatia
    Posts
    184
    Thanks
    77
    Thanked 12 Times in 11 Posts
    same here, more than 40% of my drive is filled with RW2 files, i dont want to lose originals :|

  39. #327
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    EMMA v0.1.23, available at goo.gl/EDd1sS

    Code:
    Changes:
    - Improved the text model
    I don't have much time presently for improving EMMA, so I decided to release a new version with
    the improvements I already made to the text model (described in an earlier post). The 4GB limit
    is still present, it should just be a case of changing some variables to 64 bit integers, but I'll have
    to test all of the parsers again, in every corner case, with a >4GB Tar file containing all my
    parsing test files, and check decompression. That will likely take over a week, so it'll have to
    wait, sorry.

    The result for enwik8.drt is 16.516.209 bytes, so a little under 1% relative improvement.
    I haven't tested with enwik9.

  40. Thanks (5):

    Darek (30th April 2017),encode (17th May 2017),Hacker (30th April 2017),Mike (30th April 2017),Shelwien (30th April 2017)

  41. #328
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,051 Times in 564 Posts
    I presume that "earlier post" is https://encode.su/threads/?p=52080&pp=1 .

    Emma archives still contain an old v3 mod_ppmd dll, which has a bug.
    See https://encode.su/threads/2515-mod_p...ll=1#post52046
    dll can be replaced with newer v4, or I can build a patched v3, if v4 somehow hurts compression.

  42. Thanks (2):

    Darek (30th April 2017),mpais (1st May 2017)

  43. #329
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    962
    Thanks
    573
    Thanked 397 Times in 295 Posts
    Quote Originally Posted by mpais View Post
    The result for enwik8.drt is 16.516.209 bytes, so a little under 1% relative improvement.
    Which settings did you use?
    I've got 16'523'517 with my settings - long=off, rest=max, models=off, xml=on, x86model=off, x86 exe code = off, delta coding = off, dictionary = off, ppmd 1024, 14

    Darek

  44. #330
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    523
    Thanks
    198
    Thanked 750 Times in 304 Posts
    @Shelwien
    Yes, that was the post I was talking about. I've done a few quick tests with mod_ppmd_v4, compression seems slightly worse than with v3.
    I'll run a few more tests and I'll update the archives with this new version.

    @Darek
    I used the english dic, but looking at your result it seems that the dictionaries are pointless, almost no gain at all.

  45. Thanks:

    Darek (1st May 2017)

Page 11 of 13 FirstFirst ... 910111213 LastLast

Similar Threads

  1. Context mixing file compressor for MenuetOS (x86-64 asm)
    By x3k30c in forum Data Compression
    Replies: 0
    Last Post: 12th December 2015, 07:19
  2. Context Mixing
    By Cyan in forum Data Compression
    Replies: 9
    Last Post: 23rd December 2010, 21:45
  3. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  4. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •