Results 1 to 23 of 23

Thread: VTEnc

  1. #1
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts

    VTEnc

    Hi everybody, my name is Vicente. Although I'm not strictly a new user in this forum, this is the first time I actually write something here. I've written a blog post about a new integer compressor I've come up with and I would like to get some feedback from this fantastic community. This is the link:

    https://vteromero.github.io/2019/07/28/vtenc.html

    I've done some research and I haven't found any other similar algorithm, but I may be wrong. So, I'd be interested in validating that it's genuinely novel.

    I'm already developing a library that implements the algorithm and I'm planning to publish it as an open source project. I'll keep this thread updated with the progress I'm making.

    Thanks!

  2. Thanks (2):

    data man (11th November 2019),snowcat (29th July 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,897
    Thanks
    291
    Thanked 1,267 Times in 715 Posts
    1) For 32-bit numbers you can compare your results with "lzma -lc0 -lp2 -pb2".

    2) There're some preprocessing algorithms here: https://github.com/powturbo/TurboTranspose
    which can be used in conjunction with any compression algorithm.

    3) Also these:
    http://freearc.dreamhosters.com/mm11.zip
    http://freearc.dreamhosters.com/delta151.zip
    http://nishi.dreamhosters.com/u/lucas_v0.rar https://github.com/loxxous/Jampack/b...er/filters.cpp

    4) optimfrog can be used to compress integers and floats,
    since its lossless and there's a headerless "raw audio" mode: http://losslessaudio.org/Downloads.php

    5) paq8px (here or https://github.com/hxim/paq8px) can be used to estimate max compression for your data.

  4. Thanks:

    vteromero (29th July 2019)

  5. #3
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Thanks Shelwien!

    I was thinking in comparing my results mostly against some integer algorithms like FastPFor, BitPacking, Elias-Fano, Binary Interpolative, etc. But I think it's a good idea to use some general-purpose compressor as well.

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,897
    Thanks
    291
    Thanked 1,267 Times in 715 Posts

  7. Thanks:

    vteromero (1st August 2019)

  8. #5
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    I just published an open source C library to encode/decode sorted integer sequences using VTEnc algorithm. Here is the link: https://github.com/vteromero/vtenc

    At the moment, I'm working on benchmarking the library, using different data sets and comparing it with other integer algorithms. For now, I've managed to test it using 2 different data sets (ts.txt and gov2.sorted), which are also tested by TurboPFor project. In both cases, the library delivers better compression ratio than the results shown by TurboPFor. Encoding and decoding speed are still to be tested.

    The library is in its very first version, so there many things to work on and improve. Any help and/or suggestions are welcomed!

  9. Thanks:

    dnd (29th October 2019)

  10. #6
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    972
    Thanks
    96
    Thanked 389 Times in 272 Posts
    VSEncoding: Efficient Coding and Fast Decoding of Integer Lists via Dynamic Programming
    https://www.researchgate.net/publica...ic_Programming

  11. Thanks:

    vteromero (29th October 2019)

  12. #7
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Thanks Sportman! I didn't know about VSEncoding. I'll read the paper to see if I include it as part of the benchmarks I'm working on.

  13. #8
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Here are the first results of comparing VTEnc with other integer compression algorithms:

    Code:
    | Algorithm          |Encoded Size|Ratio %|Encoding Speed|Decoding Speed|
    |:-------------------|-----------:|------:|-------------:|-------------:|
    | VTEnc              |      21,686| 0.0038|  60.9155 G/s |   734.54 M/s |
    | Delta+FastPFor256  |   1,179,312|   0.20|  2.00714 G/s |  4.75146 G/s |
    | Delta+FastPFor128  |   2,306,544|   0.40|   1.9029 G/s |  4.82925 G/s |
    | Delta+BinaryPacking|   4,552,280|   0.79|   8.5867 G/s |  5.77439 G/s |
    | Delta+VariableByte | 144,285,504|   25.0|  4.86063 G/s |  5.09461 G/s |
    | Delta+VarIntGB     | 180,356,880|  31.25|  6.75428 G/s |   9.2638 G/s |
    | Copy               | 577,141,992|  100.0|  10.4087 G/s |       -      |
    These results correspond to benchmarking the data set ts.txt, which is a text file with a large list of timestamps. This data set is distinguished by having many repeated values.

    In this case, VTEnc offers an impressive compression ratio (0.0038% of the original size) at an even more impressive speed (60 G/s). The decoding speed is not very good though.

    I'll keep you posted with new results for other data sets.
    Last edited by vteromero; 10th November 2019 at 18:35. Reason: Trying to fix format issues on table

  14. #9
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    I'd be happy to help you test, if you tell me what kinds of tests are you interested in. I would prefer if you wrote an executable using the library the way you intend for it to be used, although I believe I could compile the sample programs on the GitHub description if thatś the way to go.

  15. Thanks:

    vteromero (12th November 2019)

  16. #10
    Member
    Join Date
    Aug 2015
    Location
    The Earth
    Posts
    11
    Thanks
    3
    Thanked 21 Times in 7 Posts
    It would be interesting to compare with https://github.com/RoaringBitmap/CRoaring

  17. #11
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    I'm now focused on testing the library with different data sets. I've identified a couple of suitable ones that have been tested by others before: ts.txt (a text file with a big list of timestamps) and gov2.sorted (which is part of the "Document identifier data set" created by D. Lemire). Besides existing data sets, it'd be ideal to test it with various data distributions such as random distributions. I need to think more about this option.

    In any case, I decided to create a separate repository to do the benchmarking. So, If anyone is interesting in helping with testing/benchmarking, contributions are welcome! Here is the repository: https://github.com/vteromero/integer...ion-benchmarks

    With regards to the algorithms to test, I'd like to use as much as possible. For now, I'm using the ones included in SIMDCompressionAndIntersection because they are optimised for sorted list of integers. But I have in mind others like: Elias-Fano, Binary Interpolative, Roaring, etc.

    Thanks!

  18. #12
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts

    TurboPFor Vs. VTEnc

    I've integrated VTEnc into icapp in TurboPFor.
    ​bvzenc32 and TurboPFor256 are 2 functions in the TurboFor integer compression package.

    Benchmark 1: Timestamp file 'ts.txt'

    ./icapp ts.txt -Ft -Elzturbo,39

    Code:
      E MB/s     size     ratio     D MB/s function 
       22.67        114   0.0000%  8598.40 lzv8zenc         TurboByte zigzag+lzturbo,39        
    76615.16      21686   0.0038%   865.40 vtenc            VTEnc lib                
    13310.16      45909   0.0080% 13467.32 bvzenc32         bitio zigzag (TurboPFor)
    Benchmark 2: Generate a sorted ZIPFIAN distribution
    ./icapp ZIPF -d0 -Eturboanx
    zipf alpha=1.50 range[0..255].n=1000000

    Code:
      E MB/s     size     ratio     D MB/s function 
      156.96     518069  12.95%    2359.88 lzv8zenc         TurboByte zigzag+turboanx
       98.52     582439  14.56%      88.26 vtenc            VTEnc lib                
     1536.10     628983  15.72%    7874.02 p4ndenc256v32    TurboPFor256 delta
    The benchmark 1 can be seen only as indication, TurboByte+lzturbo compress this 600MB (converted from 1,6 GB text) timestamps to only 114 bytes.

    The benchmark 2 is more realistic and here TurboPFor is decompressing 90 times faster than VTEnc.
    Don't get me wrong, i find the work very interesting, but sorry the current VTEnc decoding is simply too slow as integer compression.
    Note that VTEnc can only process sorted integer sequences.

  19. Thanks:

    vteromero (13th November 2019)

  20. #13
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Thanks @dnd for the analysis.

    I'd like to make clear a few points about the algorithm and the corresponding library though:


    • Because of algorithmic limitations, the decoding speed is always slower than the encoding speed. For some data sets, that difference might be huge.
    • The library hasn't been fully tested yet. That's a priority for me at the moment.
    • The library hasn't been optimised in terms of performance.


    I'm aware of the limitations of the library and I'm thoroughly testing it to get a baseline to compare against and to start working on trying to improve it. Any constructive feedback and/or contribution are welcomed.

  21. #14
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Some suggestions for decoding:
    - As the bit trie length is known, remove recursion and unroll the loops.
    - inline decode_full_subtree, set_ones_at_bit_pos,...
    - compile with sse/avx2 or -march=native to enable auto vectorization.
    - inline small functions
    - use bit mainpulation instead of lookup tables.


    - Add direct access and navigation get next greather than function.
    - compare to elias fano and interpolative coding

  22. Thanks:

    vteromero (14th November 2019)

  23. #15
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    That's very useful, thanks @dnd!

    Quote Originally Posted by dnd View Post
    - As the bit trie length is known, remove recursion and unroll the loops.
    The tree depth is known upfront, but not the number of different paths, so not sure how I can do that...

    Quote Originally Posted by dnd View Post
    - inline decode_full_subtree, set_ones_at_bit_pos,...
    - inline small functions
    Yes, inlining functions is one of the things I had in mind.

    Quote Originally Posted by dnd View Post
    - compile with sse/avx2 or -march=native to enable auto vectorization.
    Nice trick, I'll try that.

    Quote Originally Posted by dnd View Post
    - use bit mainpulation instead of lookup tables.
    Interesting... Is it always better to use bitwise operations instead of lookup tables?

    Quote Originally Posted by dnd View Post
    - Add direct access and navigation get next greather than function.
    I don't know what you mean here.

    Quote Originally Posted by dnd View Post
    - compare to elias fano and interpolative coding
    Definitely, will do that.

  24. #16
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    The tree depth is known upfront, but not the number of different paths, so not sure how I can do that...
    You can probably replace the recursion by an iteration using a stack.
    Search: https://www.google.com/search?q=recu...teration+stack

    Is it always better to use bitwise operations instead of lookup tables?
    Not always better, but it's faster and more elegant to a use a simple 1<<i instead of a lookup table.

    I don't know what you mean here.
    - Direct Access: as you are using a bit trie a think it is possible to add a fast direct access function to get the value at index i
    without decompression.
    - Get next value equal or greather: given a value and a start index, search and return the next (index and value) that's '>='.
    see file vint.c in turbopfor.

  25. #17
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Location
    Lovinobana, Slovakia
    Posts
    186
    Thanks
    49
    Thanked 13 Times in 13 Posts
    @vteromero,
    is your tool able to compress repeated, but random numbers (see below)? How much?

    Code:
    984165748916348965149746136498416167491603648941631441034984161654974163649416341941610674984631489614896416106749641630648964160364194164741034974163489416031949641314941649741649616461649461049419649164896167489461631064894614987410631064974160361048941638794615748956110374894163541036741603574891689416301657489641631034974896461037498416498416309416389674166504894645025848769848769856923856971425032412098532047869525869845269745812579624868552832658623589753867854696325879635236898557429865748165416479741674894167416789416578946741636416541974163164789678496116414741661646467479494961641674946164161165646494964649741679264358579456894674658549164801670460234946106165106064149610164160149610364416530641603496416103461030346510304641034896413616416316746141136416465174864164146741616641361064641616410346846489237685289859846536452687625896345876578625872456387638793687632783678536876372786378368726874378654274687245374201876587436246368746345348634834563685

  26. #18
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Quote Originally Posted by dnd View Post
    - Direct Access: as you are using a bit trie a think it is possible to add a fast direct access function to get the value at index i
    without decompression.
    - Get next value equal or greather: given a value and a start index, search and return the next (index and value) that's '>='.
    see file vint.c in turbopfor.
    I see, that's interesting, I'll give it some thought.

  27. #19
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Quote Originally Posted by CompressMaster View Post
    @vteromero,
    is your tool able to compress repeated, but random numbers (see below)? How much?

    Code:
    984165748916348965149746136498416167491603648941631441034984161654974163649416341941610674984631489614896416106749641630648964160364194164741034974163489416031949641314941649741649616461649461049419649164896167489461631064894614987410631064974160361048941638794615748956110374894163541036741603574891689416301657489641631034974896461037498416498416309416389674166504894645025848769848769856923856971425032412098532047869525869845269745812579624868552832658623589753867854696325879635236898557429865748165416479741674894167416789416578946741636416541974163164789678496116414741661646467479494961641674946164161165646494964649741679264358579456894674658549164801670460234946106165106064149610164160149610364416530641603496416103461030346510304641034896413616416316746141136416465174864164146741616641361064641616410346846489237685289859846536452687625896345876578625872456387638793687632783678536876372786378368726874378654274687245374201876587436246368746345348634834563685
    The library can compress sorted lists of unsigned integers. The API exposes two flavours of sequences: lists and sets. Lists can held repeated values, sets cannot. An example of list would be [22, 44, 77, 77, 77, 109] and a set looks like this [12, 15, 17, 66, 1341]. Both lists and sets work with 8, 16, 32 and 64 bits.

    I don't know if this explanation answers your question.

  28. #20
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    The results of benchmarking gov2.sorted:

    Code:
    | Algorithm          |Encoded Size     |Ratio %  |Encoding Speed|Decoding Speed|
    |:-------------------|----------------:|--------:|-------------:|-------------:|
    | VTEnc              |    2,889,599,350|    12.08|     77.69 M/s|     69.08 M/s|
    | Delta+FastPFor128  |    3,849,161,656|    16.09|    855.60 M/s|    850.43 M/s|
    | Delta+FastPFor256  |    3,899,341,376|    16.30|    915.43 M/s|    916.01 M/s|
    | Delta+BinaryPacking|    4,329,919,808|    18.10|      3.03 G/s|      2.94 G/s|
    | Delta+VariableByte |    6,572,084,696|    27.48|      2.05 G/s|      2.09 G/s|
    | Delta+VarIntGB     |    7,923,819,720|    33.13|      2.54 G/s|      3.73 G/s|
    | Copy               |   23,918,861,764|    100.0|      6.27 G/s|       -      |
    As I expected, VTEnc's encoding and decoding speed are poor. The encoding ratio is the best among the tested algorithms though.

    For now, I'm going to stop doing more benchmarking and going to focus on improving the encoding and decoding speed.

  29. #21
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    Version v0.0.3 is out! Link

    Compared with the initial version, the library is now about 30% faster when encoding and over 85% faster when decoding. However, there is still a long stretch to catch up with other similar algorithms in terms of encoding and decoding speed.

    Benchmarks for gov2.sorted:

    Code:
    | Algorithm          |Encoded Size     |Ratio %  |Encoding Speed|Decoding Speed|
    |:-------------------|----------------:|--------:|-------------:|-------------:|
    | VTEnc              |    2,889,599,350|    12.08|     75.86 M/s|     97.50 M/s|
    | Delta+FastPFor128  |    3,849,161,656|    16.09|    641.80 M/s|    645.42 M/s|
    | Delta+FastPFor256  |    3,899,341,376|    16.30|    682.57 M/s|    679.37 M/s|
    | Delta+BinaryPacking|    4,329,919,808|    18.10|      2.32 G/s|      2.25 G/s|
    | Delta+VariableByte |    6,572,084,696|    27.48|      1.51 G/s|      1.59 G/s|
    | Delta+VarIntGB     |    7,923,819,720|    33.13|      1.79 G/s|      2.86 G/s|
    | Copy               |   23,918,861,764|    100.0|      4.96 G/s|       -      |
    I'm working now on v0.1.0 which will add the ability to specify a compression depth. This parameter will tell the encoder how many most-significant bits to encode, leaving the rest of least-significant bits untouched. In this way, the user will be able to fine-tune the degree of compression on each case, and consequently the encoding and decoding speed.

  30. #22
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    VTEnc v0.2.0 has been released! (CHANGELOG)

    The most remarkable feature of this version is the inclusion of encoding parameters, which need to be provided when using encoding and decoding functions. Encoding parameters give users the ability to specify the way sequences are encoded, resulting in different encoding/decoding speeds and encoding ratios. For now, there are 3 encoding parameters: allow_repeated_values, skip_full_subtrees and min_cluster_length. I'm writing a blog post to explain them in depth, I'll post the link in this thread once it's published.

    Here are the benchmark results on gov2.sorted data set:

    Click image for larger version. 

Name:	gov2_enc_speed_vs_ratio.png 
Views:	35 
Size:	48.9 KB 
ID:	7620
    Click image for larger version. 

Name:	gov2_dec_speed.png 
Views:	30 
Size:	26.4 KB 
ID:	7621

    (*) VTEnc's results on "Encoding speed vs ratio" chart are for the following values of the encoding parameter min_cluster_length: 1, 2, 4, 8, 16, 32, 64, 128 and 256.
    (**) VTEnc's decoding speed is for min_cluster_length = 256.



  31. Thanks:

    SolidComp (22nd May 2020)

  32. #23
    Member
    Join Date
    Feb 2018
    Location
    Edinburgh
    Posts
    16
    Thanks
    10
    Thanked 6 Times in 5 Posts
    If anyone is interested, here's a blog post about encoding parameters in VTEnc library: https://vteromero.github.io/2020/06/...c-library.html

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •