Results 1 to 5 of 5

Thread: Advice for compression of flat text files?

  1. #1
    Member
    Join Date
    Oct 2015
    Location
    Atlanta, GA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Advice for compression of flat text files?

    Hi all! Might anyone have any advice on what would be good options to try for compressing text files with data? Because it looks like a lot of tools are tuned for free-form text, and perhaps one can do better if structure is known to be more regular? More specifically:

    - I'd like to be able to compress both csv and fixed-width file without too much tweaking needed (worst case shouldn't be too bad)
    - Say 500MB/file on average? With some bigger, some smaller
    - csv might have a mix of short strings and numbers, comma separated, e.g. something like
    AB,12:15:01.27,27.0,27.1,4000,,I
    AB,12:15:01.28,27.1,27.1,2000,,B
    ACCC,9:00:01.03,144.12,154.22,22000,10,B
    - fixed width might have similar data, except line widths would be same: e.g. something like
    200300000000021171002261NA 010339527.00661200551 3
    200300000000021113003003NA 010339532.00772102401 4
    - maybe also csv files that contain longer strings? e.g.
    A,20,"we wanna compress both csv and fixed-width file"
    B,30,"lsdjf ldsajf!!!"

    As for the software itself:
    - Should work as a linux command-line utility
    - Ideally would support stream compression (so that it could be used as gzip replacement easily)
    - Very strongly prefer open source
    - Single-file archives only
    - Fast decoding is a priority: shouldn't be too much slower than gzip; BUT in case it supports multi-threading it's OK if it's comparable decoding speed to gzip on up to ~6 threads; encoding speed not too important unless it's ridiculously long (let's say not more than 30x slower than gzip?)

    So far I've used gzip and lbzip2 (which is much slower but OK when multithreaded).

    Thanks!

  2. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    802
    Thanks
    229
    Thanked 290 Times in 172 Posts
    What is wrong with lbzip2 for your use case?

    Try brotli with a 16 MB window (with --window 24). You probably want to try it with either --quality 11 or --quality 9.

    All brotli implementations I know are single-threaded, but brotli's encoder structure supports multi-threaded encoding. Just hasn't been implemented yet.

  3. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    You can test allmost all popular or notable compressors with with TurboBench for windows and linux (see also TurboBench).
    Last edited by dnd; 23rd October 2015 at 13:33.

  4. #4
    Member
    Join Date
    Oct 2015
    Location
    Atlanta, GA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks!
    And nothing particularly wrong with lbzip2, I was just curious if there might be other options (or compression parameters, etc.) to consider.
    BTW, one thing I noticed about lbzip2 is that compression ratio sometimes changes significantly if columns in the .csv file are reordered, with no other modifications.
    I'll definitely check out brotli, and look at what turbobench does.

  5. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Lzham is an open source command line compressor, runs on Linux, and provides a good compression ratio with fast decoding.

Similar Threads

  1. Compression library advice
    By bloom256 in forum Data Compression
    Replies: 24
    Last Post: 23rd November 2015, 22:22
  2. Need some advice on compression artifact
    By boxerab in forum Data Compression
    Replies: 0
    Last Post: 27th May 2014, 20:59
  3. Replies: 33
    Last Post: 27th August 2011, 05:13
  4. Small dictionary prepreprocessing for text files
    By Matt Mahoney in forum Data Compression
    Replies: 40
    Last Post: 23rd June 2011, 06:04
  5. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 15:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •