Results 1 to 28 of 28

Thread: Fascinating compression cost model, where cost = $$$

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts

    Fascinating compression cost model, where cost = $$$

    Hi all – This report by LucidChart is remarkable. They tested brotli, gzip, Zstd, and other compression, as well as numerous binary serialization formats, and evaluated them based on dollars saved in storage and compute costs. They use AWS Lambdas and S3 storage. (Lambdas are pay as you go functions, instead of traditional server instances.)

    The most compact binary format, uncompressed, was Amazon's new Ion format. (They didn't test FlatBuffers or Protocol Buffers, or Cap'n Proto, but they tested CBOR, Smile, BSON, etc.) They tested every format in combination with every level of every compressor.

    The winner was JSON + Brotli 11. This surprised me. It turns out that the compression codec is more important than the data serialization format. Even though JSON is a bloaty text format, Brotli 11 sort of makes up for it. I'm not sure why Ion + Brotli 11 didn't win, at least by a little bit. Their use case was more focused on long-term storage. When people discover that compression codecs are more important than underlying data formats, they forget that the uncompressed payload is going to take up RAM, so uncompressed size matters too, but if all you're worried about is long-term compressed storage then that doesn't matter as much.

  2. Thanks:

    Shelwien (11th June 2020)

  3. #2
    Member
    Join Date
    Mar 2016
    Location
    USA
    Posts
    58
    Thanks
    8
    Thanked 24 Times in 16 Posts
    It looks like compressed textual Ion did slightly beat JSON for size not counting cost. My guess would be that the textual formats have a restricted range of common characters that provide for more matches and thus less space used once compressed compared to first binary encoding the data, similar to how processors like precomp can give more compression. I am curious what kind of overhead Ion has (it's a JSON superset) to fall off the cost efficiency list, but since the start and end formats were JSON, any conversion is likely to lose.

  4. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Far more domain specific, but I did a similar analysis at some point for genome sequencing alignment formats: BAM vs CRAM.

    Again it was using S3 storage (vanilla S3, but also things like glacier) and CPU cost (it was a spot-price at the time). I was considering encode + decode, but took a different approach. Instead of asking what is the cheapest format for 1 month, I asked at what time period do you break even. Given CRAM is so much smaller than BAM but less well supported, I was considering the use case of tools that only support BAM, but where we convert BAM to CRAM, store it for a while, and then convert back to (uncompressed) BAM when the tool next needs to reread the file. This method is pretty sound for any file / compression format and any tool.

    I don't have a the slides readily available, but basically it come out as a staggeringly low 1 day! Guess which format is still dominant. *thud* I think it was 3 days if not using spot pricing and going for on-demand, or closer to a month when you get to the really cheap storage systems, but if you're using those you also do want to have your data not being accesed as the cost of re-reading is high.

  5. Thanks:

    Mike (11th June 2020)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Thanks for posting this, we're actually struggling with similar metrics for the contest.

    Actual formula seems to be like this:
    $GB = 1024*1024*1024;

    $B4 = 0.000002917 * 10; # Lamda Cost per Second (for a full vCPU)
    $B5 = 0.022; # S3 Cost (per GB/Month)
    $B6 = 0.0125; # S3 IA Cost (per GB/Month)
    $B11 = 0.9; # Percent of S3 data in IA (rather than Standard)
    $B12 = 90; # Expected Lifespan of data (in months)
    $B13 = 2; # Number of times (on average) decompressing a given snapshot
    $B14 = 6; # CPU Multiplier

    $B19 = $B4*$B14; # Expected Cost per CPU millisecond of measured time
    $B20 = $B11*$B6 + (1-$B11)*$B5; # Average Cost per GB per month

    $cost = $csize/$GB*($B20)*$B12+($ctime+$dtime*$B13)*$B19;


    But *6*10 multipliers for csize went unexplained afaik, and this metric gives too much priority to compression ratio imho.
    (Where brotli wins, because only brotli has integrated dictionary enabled by default).
    When this metric is applied to enwik10 benchmark results we get this:

    Code:
      2095 1675874699   781.839   198.309 // bwtturbo -59 -t0 (v20.2)
      2212 1648810080   957.511   530.447 // bsc -m0 -b1024 -e1 -T  (v3.1.0)
      2216 1679319006   765.832   540.897 // turborc -26 (v20.01, Feb)
      2218 1722407658   778.796   401.317 // m99 -b1000000000 -t1 (beta)
      2251 1638441156  1030.489   640.502 // bsc -m0 -b1024 -e2 -T (v3.1.0)
      2335 1679330904   757.786   885.980 // turborc -26 (v20.01, Jan)
      2473 1718776810  1267.296   896.259 // rings -m8 -t1 (v2.5)
      2484 1907435970  1188.583   360.458 // nz -co -t1 (v0.09 alpha)
      2490 1632628624  1146.133  1284.451 // bcm -9 (v1.40)
      2491 1852597854   403.030   948.316 // bsc -b1024 -m6 -e2 -T (v3.1.0)
      2505 1899403918  1327.809   375.295 // nz -cO -t1 (v0.09 alpha)
      2660 2129702726  1048.281   218.249 // flashzip -mx3 -k7 -b1024 -t1 (v1.1.3)
      2729 2339988137   372.592    73.660 // orz (v1.6.1)
      2904 2393693507   372.309   402.165 // ppmd e -m256 -m16 
      2925 1670916589  1890.327  2031.961 // zpaq -m311 -t1 (v7.15)
      2929 1819439647  1593.126  1713.603 // ccm 7 (v1.30c)
      2934 1595984275  2165.922  2160.740 // zcm -m8 -t1 (v0.93)
      2959 1450364034  2701.335  2433.988 // mcm -x -m11 (v0.83)
      2989 2533351070   692.367    34.851 // rar -m2 -ma5 -mt1 (v5.80)
      3011 2429129903  1492.971    34.008 // rar -m3 -ma5 -mt1 (v5.80)
      3028 2450374607  1495.895    12.653 // zstd -16 (v1.4.4)
      3033 2660370879   153.103    19.993 // lzturbo -32 -p0 (v1.2)
      3036 2559864772   455.693   201.243 // balz c (v1.20)
      3050 2536732397   327.778   380.364 // sr3 c
      3056 2530792260   891.915   136.339 // 7z -tbzip2 -mx5 -mmt1 (v19.02)
      3062 2380937714  1784.430   188.703 // balz cx (v1.20)
      3070 2569453043   967.368    12.997 // zstd -14 (v1.4.4)
      3074 2380073078  2210.100    13.032 // zstd -17 (v1.4.4)
      3079 2669751039   374.093    13.039 // zstd -11 (v1.4.4)
      3082 2614435940   745.449    12.699 // zstd -13 (v1.4.4)
      3084 2573497114  1007.449    18.956 // brotli -q 8 (v1.0.7)
      3084 2645099063   560.829    13.136 // zstd -12 (v1.4.4)
      3088 2359812241  2255.704    96.261 // zpaq -m211 -t1 (v7.15)
      3089 2639801619   606.180    19.867 // brotli -q 7 --large_window=30 (v1.0.7)
      3090 2564942693  1094.956    20.000 // brotli -q 8 --large_window=30 (v1.0.7)
      3091 2642417079   603.066    18.940 // brotli -q 7 (v1.0.7)
      3092 2694582971   288.670    12.842 // zstd -10 (v1.4.4)
      3094 2337506087  2596.459    13.159 // zstd -18 --single-thread (v1.4.4)
      3096 1975858589  1772.406  1595.793 // zpaq -m3 -t1 (v7.15)
      3105 2539608782  1356.777    13.194 // zstd -15 (v1.4.4)
      3106 2347338736  2601.788    13.256 // zstd -18 (v1.4.4)
      3110 2205939963  3458.376    52.265 // 7z -t7z -mx5 -mmt1 (v19.02)
      3119 2465667570  1738.196   102.153 // arc -m4x -mt1 (v0.67)
      3123 2464789337  1757.395   106.073 // arc -m4 -mt1 (v0.67)
      3142 2512659846  1734.988    18.909 // brotli -q 9 (v1.0.7)
      3146 2750214835   238.516    12.896 // zstd -9 (v1.4.4)
      3149 2621986425   842.478   131.427 // 7z -tbzip2 -mx3 -mmt1 (v19.02)
      3155 2114649146  4297.935    53.732 // 7z -t7z -mx7 -mmt1 (v19.02)
      3156 2299225559  3196.329    13.559 // zstd -19 --single-thread (v1.4.4)
      3168 2780718316   168.903    12.921 // zstd -8 (v1.4.4)
      3172 2197171322  3947.477    14.471 // zstd -20 --ultra --single-thread (v1.4.4)
      3192 2377987975  2855.844    34.128 // rar -m4 -ma5 -mt1 (v5.80)
      3196 2303077470  3402.543    13.394 // zstd -19 (v1.4.4)
      3198 2780429266   324.231    20.307 // brotli -q 6 --large_window=30 (v1.0.7)
      3199 2059053547  4909.124    55.188 // 7z -t7z -mx9 -mmt1 (v19.02)
      3200 2780438122   335.040    19.575 // brotli -q 6 (v1.0.7)
      3206 2819369488   131.372    13.304 // zstd -7 (v1.4.4)
      3208 2491076452  2246.489    20.668 // brotli -q 9 --large_window=30 (v1.0.7)
      3219 2136340302  4604.696    15.121 // zstd -21 --ultra --single-thread (v1.4.4)
      3246 2766202609   511.078   111.412 // arc -m3x -mt1 (v0.67)
      3250 2764992952   530.674   115.419 // arc -m3 -mt1 (v0.67)
      3256 2166669782  4468.655    91.262 // arc -m5 -mt1 (v0.67)
      3256 2168961223  4460.496    87.027 // arc -m5x -mt1 (v0.67)
      3265 2812779013   412.488    64.311 // 7z -t7z -mx3 -mmt1 (v19.02)
      3266 2604699073  1722.851    81.555 // zpaq -m2 -t1 (v7.15)
      3271 2080479075  5257.974    15.484 // zstd -22 --ultra --single-thread (v1.4.4)
      3277 2770795320   761.115    58.366 // nz -cD -t1 (v0.09 alpha)
      3285 2776664445   767.958    58.799 // nz -cDP -t1 (v0.09 alpha)
      3288 2108472866  5035.285    86.917 // arc -m6x -mt1 (v0.67)
      3288 2872160117   244.212    21.151 // brotli -q 5 --large_window=30 (v1.0.7)
      3289 2872161150   251.839    20.411 // brotli -q 5 (v1.0.7)
      3291 2105664967  5058.601    91.096 // arc -m6 -mt1 (v0.67)
      3305 2802836601   713.652    58.594 // nz -cDp -t1 (v0.09 alpha)
      3309 2530138208  2337.551   136.101 // 7z -tbzip2 -mx7 -mmt1 (v19.02)
      3315 2269340648  4296.761    14.471 // zstd -20 (v1.4.4)
      3316 2921997106    96.765    14.029 // zstd -6 (v1.4.4)
      3328 2048387913  5643.999    90.970 // arc -m7 -mt1 (v0.67)
      3328 2051647414  5626.398    86.924 // arc -m7x -mt1 (v0.67)
      3362 2357818671  3953.092    34.300 // rar -m5 -ma5 -mt1 (v5.80)
      3368 2901657843   381.952    85.875 // zpaq -m111 -t1 (v7.15)
      3376 2947761814   196.714    53.101 // arc -m2x -mt1 (v0.67)
      3379 2945920748   219.401    57.051 // arc -m2 -mt1 (v0.67)
      3385 1998346156  6299.177    86.288 // arc -m8x -mt1 (v0.67)
      3386 2600495919  2546.352    27.126 // bcrush -9 --optimal (v0.2.0)
      3388 1995184284  6329.592    90.224 // arc -m8 -mt1 (v0.67)
      3392 2993531459    72.476    14.123 // zstd -5 (v1.4.4)
      3402 2229014084  5052.537    15.063 // zstd -21 (v1.4.4)
      3410 1976138624  6585.154    85.647 // arc -m9x -mt1  (v0.67)
      3414 3004577735    52.249    53.043 // slug (v1.27b)
      3416 1973568508  6626.946    89.762 // arc -m9 -mt1 (v0.67)
      3423 2811491748   858.957   296.051 // nlzm -window:30 cf
      3476 3072049859    46.019    14.061 // zstd -4 (v1.4.4)
      3488 2948591837   731.412   103.179 // 7z -tbzip2 -mx1 -mmt1 (v19.02)
      3497 3028529073   326.396    75.082 // 7z -t7z -mx1 -mmt1 (v19.02)
      3501 1776602973  5371.128  1595.793 // zpaq -m4 -t1 (v7.15)
      3516 3078914611   240.124     9.381 // zhuff -c2 -t1 (v0.99beta)
      3520 3095248795   137.881    20.738 // brotli -q 4 (v1.0.7)
      3520 3095575993   133.367    21.501 // brotli -q 4 --large_window=30 (v1.0.7)
      3548 3137189321    43.010    13.343 // zstd -3 (v1.4.4)
      3558 3112379088   164.636    60.788 // thor e4 (v0.96 alpha)
      3562 3095434783   266.915    74.614 // zpaq -m1 -t1 (v7.15)
      3567 3145771456    50.885    34.649 // etincelle (RC2)
      3567 3149492468    54.666    20.956 // lzturbo -31 -p0 (v1.2)
      3574 2126337015  6682.010    21.238 // lzturbo -39 -p0 (v1.2)
      3590 3158641894   150.717     9.848 // zhuff -c1 -t1 (v0.99beta)
      3605 3142120444   244.059    57.728 // thor e5 (v0.96 alpha)
      3654 3220150965    94.157    22.117 // brotli -3 (v1.0.7)
      3661 2364340233  5658.734    15.333 // zstd -22 (v1.4.4)
      3681 3127103081   811.434    40.315 // 7z -tzip -mx5 -mmt1 (v19.02)
      3707 3268318786    84.320    22.739 // brotli -q 2 --large_window=30 (v1.0.7)
      3707 3268318786    84.742    22.775 // brotli -q 2 (v1.0.7)
      3716 3263468943   143.662    34.122 // crush cf
      3721 3292420658    37.645     9.609 // zhuff -c0 -t1 (v0.99beta)
      3731 3240805000   422.917    10.572 // lizard -45 (v1.0.0)
      3734 3242491764   341.968    53.809 // gzip -7 (v1.3.12)
      3736 2028356329  8110.938    87.672 // lzturbo -49 -p0 (v1.2)
      3736 3249895474   307.467    53.869 // gzip -6 (v1.3.12)
      3737 3237812198   392.835    53.771 // gzip -9 (v1.3.12)
      3737 3237924757   390.420    53.756 // gzip -8 (v1.3.12)
      3749 3292612761   149.939    32.771 // bcrush -1 (v0.2.0)
      3758 3277334825   245.406    59.173 // nz -cd -t1 (v0.09 alpha)
      3759 3325794738    33.481    12.952 // zstd -2 (v1.4.4)
      3773 3294626623   230.555    54.414 // gzip -5 (v1.3.12)
      3786 3281731612   374.211    59.403 // nz -cdP -t1 (v0.09 alpha)
      3812 2220027943  7439.064    22.690 // brotli -q 10 (v1.0.7)
      3812 3313143745   319.411    59.965 // nz -cdp -t1 (v0.09 alpha)
      3821 3010547116  2420.188    10.074 // lizard -49 (v1.0.0)
      3830 3373717657   129.235    11.727 // lzturbo -22 -p0 (v1.2)
      3848 3373871144   175.480    41.977 // 7z -tzip -mx1 -mmt1 (v19.02)
      3849 3373871144   175.698    42.025 // 7z -tzip -mx3 -mmt1 (v19.02)
      3888 3403958298   180.169    55.199 // gzip -4 (v1.3.12)
      3919 3113460664  2257.262    40.351 // 7z -tzip -mx7 -mmt1 (v19.02)
      3955 3496587503    43.607    18.271 // lzturbo -30 -p0 (v1.2)
      3979 3343138171  1183.661     8.654 // lizard -39 (v1.0.0)
      3982 1963826556 10050.010    28.048 // brotli -q 10 --large_window=30 (v1.0.7)
      3991 3517193402    56.045    47.756 // arc -m1 -mt1 (v0.67)
      3991 3517193402    56.106    47.633 // arc -m1x -mt1 (v0.67)
      4013 3545817857    41.997    24.646 // brotli -q 1 --large_window=30 (v1.0.7)
      4013 3545817857    42.569    24.609 // brotli -q 1 (v1.0.7)
      4037 1691093945  3971.057  4102.109 // cmm4 77 (v0.1e)
      4052 3551047816   168.021    55.993 // gzip -3 (v1.3.12)
      4079 3583341678   133.771    47.076 // rar -m1 -ma5 -mt1 (v5.80)
      4101 2635875708  6429.396    14.196 // lzturbo -29 -p0 (v1.2)
      4110 3638532812    23.309    11.745 // zstd -1 (v1.4.4)
      4123 3621357269    62.690    84.463 // nz -cF -t1 (v0.09 alpha)
      4187 3673955779   143.553    57.417 // gzip -2 (v1.3.12)
      4194 3653797048   410.854     8.592 // lizard -25 (v1.0.0)
      4199 3689371924   214.152     7.134 // lz4 -9 (v1.9.2)
      4199 3694752709   180.757     7.362 // lz4 -7 (v1.9.2)
      4209 3713877399   113.441     8.040 // lizard -35 (v1.0.0)
      4227 3727069603   132.560     7.187 // lz4 -5 (v1.9.2)
      4241 3729651972    78.545    67.090 // thor e3 (v0.96 alpha)
      4248 3408619518  2301.506     8.174 // lizard -29 (v1.0.0)
      4257 3765930127    43.910    11.690 // lzturbo -21 -p0 (v1.2)
      4264 3660882443   767.399     7.633 // lz4x -9 (v1.60)
      4265 3770151519    34.141    26.256 // brotli -q 0 --large_window=30 (v1.0.7)
      4265 3770151519    34.216    26.236 // brotli -q 0 (v1.0.7)
      4286 3660882443   882.019    13.095 // blz4 -9 (0.2.0)
      4306 2530138098  8036.375   135.975 // 7z -tbzip2 -mx9 -mmt1 (v19.02)
      4339 3669477746  1143.735     7.320 // lizard -19 (v1.0.0)
      4354 3823273187   136.146    59.070 // gzip -1 (v1.3.12)
      4357 3848919298    89.150     7.451 // lz4 -3 (v1.9.2)
      4376 3860617999   121.830     7.159 // lizard -15 (v1.0.0)
      4412 3895465180   107.015     7.319 // lz4x -5 (v1.60)
      4414 3891832957   129.520    13.702 // blz4 -5 (v0.2.0)
      4417 3909521247    32.603    11.287 // lizard -40 (v1.0.0)
      4432 3913220262   102.474     8.227 // lzturbo -12 -p0 (v1.2)
      4434 2746032286  7587.384    29.749 // crush cx
      4496 3111145623  5571.823    40.397 // 7z -tzip -mx9 -mmt1 (v19.02)
      4766 4217868600    46.518     8.474 // lzturbo -11 -p0 (v1.2)
      4786 4220364579   128.482    16.681 // qpress64 -L3T1 (v1.1)
      4830 4250963422   105.659    55.171 // brieflz (v1.05)
      4830 4260604506    46.419    55.936 // nz -cf -t1 (v0.09 alpha)
      4867 4310953883    27.091     8.435 // lzturbo -20 -p0 (v1.2)
      4939 4371496854    46.907     7.282 // lz4x -1 (v1.60)
      4941 4375561039    26.721    10.684 // lizard -30 (v1.0.0)
      4947 2172589967 14232.804    20.310 // brotli -q 11 (v1.0.7)
      4970 4395542076    47.994    20.250 // qpress64 -L2T1 (v1.1)
      5130 3659315526  5722.456     8.793 // lzturbo -19 -p0 (v1.2)
      5186 1921561064 17200.759    27.147 // brotli -q 11 --large_window=30 (v1.0.7)
      5224 1625985127  6338.974  6519.437 // zpaq -m411 -t1 (v7.15)
      5228 2154049598  5317.170  5339.703 // bbb cf (v1)
      5267 4665488020    25.931     8.760 // lizard -20 (v1.0.0)
      5267 4666386317    26.686     4.827 // lzturbo -10 -p0 (v1.2)
      5467 1690830567  6520.194  6912.919 // lpq1 (v2)
      5529 1708824130  8236.253  6174.735 // bbb cfm329 (v1)
      5529 4883575606    34.957    50.879 // thor e1 (v0.96 alpha)
      5571 4931432031    28.371    20.632 // qpress64 -L1T1 (v1.1)
      5578 1644097084 21097.196    93.130 // rz (v1.03.7)
      5643 1644097084 21468.090    93.510 // rz -d 1023M (v1.03.7)
      5681 5034758325    18.449     7.311 // lz4 -1 (v1.9.2)
      5687 5039464624    20.596     7.684 // lizard -10 (v1.0.0)
      5808 1641713813 10036.931  6287.222 // bbb cm1000 (v1)
      5974 1639360679  7899.369  7839.332 // nz -cc -t1 (v0.09 alpha)
      6628 1951631481 24754.394   273.575 // nlzm -window:30 c
     10780 1667596745 16867.597 16993.461 // zpaq -m5 -t1 (v7.15)
     11712 1499545945 18897.682 19180.835 // zpaq -m511 -t1 (v7.15)
     15903 1638368188 37005.367 21654.024 // bbb cm1000 (v1.6)

  7. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    That's almost (but not quite) just a sort by ascending file size.

    I'd agree it's over emphasising the ratio. bcm -9 is ~5x quicker than zpaq -m411 and just 0.4% larger, but still ranks poorer. Where is the /1000 coming from in the cpu cost part of the formula? Also maybe 90 months is a bit excessive for duration. (I find my large files tend to have smaller duration than the small ones, because if I'm running out of space then they're the ones I target when tidying up.)

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > Where is the /1000 coming from in the cpu cost part of the formula?

    Thanks, it was my mistake - enwik10 benchmark has times in seconds, while LucidChart formula is for times in ms.
    I updated the table in the post above, now its not sorted by csize.

  9. Thanks:

    JamesB (17th June 2020)

  10. #7
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    I would be curious to see how the rankings evolve when lambda changes in "cost(lambda) = csize + lambda * (ctime+dtime)" (same formula as used above).

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > I would be curious to see how the rankings evolve when lambda changes

    You can tweak metric.pl in attach, I suppose.

    > in "cost(lambda) = csize + lambda * (ctime+dtime)" (same formula as used above).

    Well, stronger compressors would get higher or lower ranks, depending on lambda?

    They use ctime+2*dtime though.
    Attached Files Attached Files

  12. Thanks:

    hexagone (12th June 2020)

  13. #9
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    cost(lambda) = cratio + lambda * (ctime + 2*dtime)

    Top 10 for several values of lambda (score, csize, ctime, dtime):

    Code:
    lambda: 0.0000001    
        0.136 1450364034  2701.335  2433.988 // mcm -x -m11 (v0.83)
        0.145 1499545945 18897.682 19180.835 // zpaq -m511 -t1 (v7.15)
        0.149 1595984275  2165.922  2160.740 // zcm -m8 -t1 (v0.93)
        0.152 1632628624  1146.133  1284.451 // bcm -9 (v1.40)
        0.153 1625985127  6338.974  6519.437 // zpaq -m411 -t1 (v7.15)
        0.153 1638441156  1030.489   640.502 // bsc -m0 -b1024 -e2 -T (v3.1.0)
        0.154 1648810080   957.511   530.447 // bsc -m0 -b1024 -e1 -T  (v3.1.0)
        0.155 1639360679  7899.369  7839.332 // nz -cc -t1 (v0.09 alpha)
        0.155 1641713813 10036.931  6287.222 // bbb cm1000 (v1)
        0.155 1644097084 21097.196    93.130 // rz (v1.03.7)
    
    lambda: 0.000001
        0.143 1450364034  2701.335  2433.988 // mcm -x -m11 (v0.83)
        0.155 1595984275  2165.922  2160.740 // zcm -m8 -t1 (v0.93)
        0.155 1638441156  1030.489   640.502 // bsc -m0 -b1024 -e2 -T (v3.1.0)
        0.156 1632628624  1146.133  1284.451 // bcm -9 (v1.40)
        0.156 1648810080   957.511   530.447 // bsc -m0 -b1024 -e1 -T  (v3.1.0)
        0.157 1675874699   781.839   198.309 // bwtturbo -59 -t0 (v20.2)
        0.158 1679319006   765.832   540.897 // turborc -26 (v20.01, Feb)
        0.159 1679330904   757.786   885.980 // turborc -26 (v20.01, Jan)
        0.162 1670916589  1890.327  2031.961 // zpaq -m311 -t1 (v7.15)
        0.162 1722407658   778.796   401.317 // m99 -b1000000000 -t1 (beta)
    
    lambda: 0.00001
        0.168 1675874699   781.839   198.309 // bwtturbo -59 -t0 (v20.2)
        0.174 1648810080   957.511   530.447 // bsc -m0 -b1024 -e1 -T  (v3.1.0)
        0.175 1679319006   765.832   540.897 // turborc -26 (v20.01, Feb)
        0.176 1638441156  1030.489   640.502 // bsc -m0 -b1024 -e2 -T (v3.1.0)
        0.176 1722407658   778.796   401.317 // m99 -b1000000000 -t1 (beta)
        0.182 1679330904   757.786   885.980 // turborc -26 (v20.01, Jan)
        0.189 1632628624  1146.133  1284.451 // bcm -9 (v1.40)
        0.191 1718776810  1267.296   896.259 // rings -m8 -t1 (v2.5)
        0.196 1852597854   403.030   948.316 // bsc -b1024 -m6 -e2 -T (v3.1.0)
        0.197 1907435970  1188.583   360.458 // nz -co -t1 (v0.09 alpha)
    
    lambda: 0.0001
        0.267 2660370879   153.103    19.993 // lzturbo -32 -p0 (v1.2)
        0.270 2339988137   372.592    73.660 // orz (v1.6.1)
        0.274 1675874699   781.839   198.309 // bwtturbo -59 -t0 (v20.2)
        0.278 2780718316   168.903    12.921 // zstd -8 (v1.4.4)
        0.278 2819369488   131.372    13.304 // zstd -7 (v1.4.4)
        0.282 2694582971   288.670    12.842 // zstd -10 (v1.4.4)
        0.283 2750214835   238.516    12.896 // zstd -9 (v1.4.4)
        0.285 2921997106    96.765    14.029 // zstd -6 (v1.4.4)
        0.289 2669751039   374.093    13.039 // zstd -11 (v1.4.4)
        0.289 2993531459    72.476    14.123 // zstd -5 (v1.4.4)
    
    lambda: 0.001
        0.360 3072049859    46.019    14.061 // zstd -4 (v1.4.4)
        0.362 3137189321    43.010    13.343 // zstd -3 (v1.4.4)
        0.363 3292420658    37.645     9.609 // zhuff -c0 -t1 (v0.99beta)
        0.369 3325794738    33.481    12.952 // zstd -2 (v1.4.4)
        0.380 2993531459    72.476    14.123 // zstd -5 (v1.4.4)
        0.386 3638532812    23.309    11.745 // zstd -1 (v1.4.4)
        0.390 3149492468    54.666    20.956 // lzturbo -31 -p0 (v1.2)
        0.397 2921997106    96.765    14.029 // zstd -6 (v1.4.4)
        0.406 3496587503    43.607    18.271 // lzturbo -30 -p0 (v1.2)
        0.413 3145771456    50.885    34.649 // etincelle (RC2)
    
    lambda: 0.01
        0.798 4666386317    26.686     4.827 // lzturbo -10 -p0 (v1.2)
        0.800 5034758325    18.449     7.311 // lz4 -1 (v1.9.2)
        0.807 3638532812    23.309    11.745 // zstd -1 (v1.4.4)
        0.829 5039464624    20.596     7.684 // lizard -10 (v1.0.0)
        0.841 4310953883    27.091     8.435 // lzturbo -20 -p0 (v1.2)
        0.869 4665488020    25.931     8.760 // lizard -20 (v1.0.0)
        0.875 3292420658    37.645     9.609 // zhuff -c0 -t1 (v0.99beta)
        0.888 4375561039    26.721    10.684 // lizard -30 (v1.0.0)
        0.904 3325794738    33.481    12.952 // zstd -2 (v1.4.4)
        0.916 3909521247    32.603    11.287 // lizard -40 (v1.0.0)

  14. Thanks:

    JamesB (17th June 2020)

  15. #10
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Wait, what lambda are you guys talking about?

  16. #11
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    I wonder how it would come out if they used a dictionary with Zstd.

  17. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > I wonder how it would come out if they used a dictionary with Zstd.

    zstd results would be (29453/28942-1)*100=1.76% smaller?

  18. #13
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Shelwien View Post
    > I wonder how it would come out if they used a dictionary with Zstd.

    zstd results would be (29453/28942-1)*100=1.76% smaller?

    What? Where are you getting those numbers? It was much bigger than 1.76% savings in compressing jQuery.

  19. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    That was the improvement using brotli dictionary.

    "It was much bigger" only when you used jQuery itself for the dictionary - that doesn't count.

  20. #15
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Shelwien View Post
    That was the improvement using brotli dictionary.

    "It was much bigger" only when you used jQuery itself for the dictionary - that doesn't count.
    Hah! I forgot about that. With this kind of JSON use case, I think the dictionary would be pretty tight. I mean there would be standard field/key names that would be repeated in all files or streams, and probably lots of common values. But I guess those matches would be happening without a dictionary. It would be interesting to find out. The use case I was thinking of most was JSON in REST payments APIs. So it would be small data and very repetitive, like "account number": "nnnnnnnnnnn", and so forth.

  21. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    With 1.76% dictionary fix for non-brotli csize it actually becomes like this:
    Code:
    0.016054766361996;MessagePack+XZ (2);12877746.4866737;2850.3;2965.3;$0.01631
    0.0160921190684587;JSON+XZ (6);12684482.7515024;5850.2;2194.5;$0.01634
    0.0161333338765781;MessagePack+XZ (6);12513336.5098292;5831.7;2872.7;$0.01638
    0.0161439361758531;MessagePack+XZ (1);13026299.6320918;2581.3;2876.1;$0.01640
    0.0161447936241505;MessagePack+Zstandard (15);13262693.8552949;2209.1;2303.3;$0.01641
    0.016174707407867;MessagePack+XZ (3);12832367.6947;4001.5;2878.5;$0.01643
    0.0161920949616163;MessagePack+XZ (5);12620913.1367263;5471.3;2874.3;$0.01644
    0.0162107157912786;JSON+Brotli (10);12358964;8852.6;2080.5;$0.01621
    0.0162573519920186;JSON+XZ (5);12918483.2385156;5326.6;2174.7;$0.01651
    0.0162614136776384;MessagePack+Zstandard (12);13526632.7497369;1376.9;2202.5;$0.01653
    0.0162892039243166;JSON+Zstandard (15);13508140.2532849;2325.6;1867.1;$0.01656
    0.0162921999107508;MessagePack+Zstandard (9);13570086.5297253;1077.3;2300.3;$0.01656
    Also using XZ instead of plain LZMA is dumb. LZMA has options (lc8 lp0 pb0 fb273 mainly) that improve text compression,
    while XZ is a much more limited wrapper.

  22. #17
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Wait, what lambda are you guys talking about?
    It is a common way to compute a cost function by adding an extra free parameter (see the formula). Based on the value, one can balance the cost due to the compression ratio with the cost due the comp/decomp speed. If you look at the numbers, with a small lambda value, the compressors with the lowest costs are the ones that compress the most (mostly ignoring speed). With an intermediate values, you see more balanced compressors at the top and finally with higher values, the fastest compressors win. So, it allows one to pick the "best" compressors for a specific scenario.

  23. Thanks:

    Bulat Ziganshin (15th June 2020)

  24. #18
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Shelwien, I don't understand your dictionary fix. What exactly is the 1.76%? That's brotli with dictionary vs. brotli without dictionary? How did you alter the results?

    If brotli wins, brotli wins. It doesn't matter if it uses a dictionary while others don't – that's their problem. Unless the brotli dictionary can be used by other compressors in production environments, which I wasn't aware of. You posted that brotli dictionary file on the other thread as though I could do something with it, so that suggests that maybe Zstd can use it.

    I don't understand what you did to the results to make XZ win and dominate. As you said, XZ isn't that great. Did you just add 1.76% to the brotli sizes, or shave the others by that amount? How is that valid? Whatever you did needs to reflect the actual reality of using a given compression codec in a production environment.

  25. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > Shelwien, I don't understand your dictionary fix. What exactly is the 1.76%?
    > That's brotli with dictionary vs. brotli without dictionary?

    That's zstd vs zstd -D using brotli dictionary.

    > How did you alter the results?

    Like this:
    $csize*=28942/29453 unless $name=~/brot/i;


    > If brotli wins, brotli wins.
    > It doesn't matter if it uses a dictionary while others don't – that's their problem.

    However in fact, its usually a problem of fake advertisement and
    ignorance of benchmark organizers.

    > Unless the brotli dictionary can be used by other compressors
    > in production environments, which I wasn't aware of.

    Yes, some compressors (mainly zstd) do have explicit options for external dictionary usage.

    But even when they don't, its not really an unique feature.
    It can be easily simulated for most codecs via csize(dict+file)-csize(dict)).
    In addition, PPM/CM codecs can use precomputed statistics (eg. see ppmtrain).

    So its just not fair to only allow brotli to rely on this trick.
    And in fact, the trick goes even further - brotli has transformation rules
    for the dictionary (capital conversion etc) so the effective dictionary
    is not really dictionary.bin, but something that has to be specially generated
    to make the comparison with other codecs really fair.

    > You posted that brotli dictionary file on the other thread as though
    > I could do something with it, so that suggests that maybe Zstd can use it.

    Yes, there're even multiple ways.

    > I don't understand what you did to the results to make XZ win and dominate.
    > As you said, XZ isn't that great. Did you just add 1.76% to the brotli sizes,
    > or shave the others by that amount? How is that valid?

    Sure, its an approximation, but to get precise results I'd have to rebuild
    their whole benchmark with all other codecs used with brotli dictionary.
    I guess we can just rebuild only brotli results without dictionary,
    but it would be hard to justify.

    Despite defaults and XZ overhead, LZMA is still a more advanced format than brotli,
    since it uses adaptive order1 arithmetic coding vs brotli's static huffman.
    Also brotli encoding is very slow and this benchmark also takes the speed into account.

    > Whatever you did needs to reflect the actual reality
    > of using a given compression codec in a production environment.

    Actual reality is that most codecs don't use best parameter profiles by default,
    and also don't have an integrated dictionary.

    And if we really care about actual performance, it should be a good idea to
    compare apples to apples, even if it requires reading parameter description for each codec.

  26. Thanks:

    SolidComp (16th June 2020)

  27. #20
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes? (This is my Positional Data Syntax, where order indicates field/key names).

    123456789012
    AB-11-9876
    6011000995500000
    1218123
    0
    TOM JONES
    123 MAIN STREET
    New York
    NY
    US
    55555
    Y
    E
    n
    Y
    Y

  28. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes?

    Mainly because you're comparing apple to oranges again?
    I mean, brotli outputs raw brotli format, while 7z creates archive formats with header and index.
    Raw optimized deflate output (zopfli or kzip+deflopt + rawdet) is 93 bytes.
    Raw lzma (without .7z or .lzma header) is 94 bytes, plzma is 89 bytes, ppmd 81 bytes.

    ...brotli with zeroed-out dictionary = 97 bytes.

  29. Thanks:

    algorithm (17th June 2020)

  30. #22
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    900
    Thanks
    246
    Thanked 326 Times in 199 Posts
    In my back of the envelope calculation if there are more than 12 users downloading the data, brotli 11 is worth it -- if the computation can be done without being part of the latency chain.

  31. #23
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    900
    Thanks
    246
    Thanked 326 Times in 199 Posts
    brotli got quite a lot worse after zopflication for very short data -- if you take an early brotli version from 2014, you'd see perhaps 20 bytes less for 100 byte compression range.

  32. #24
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    900
    Thanks
    246
    Thanked 326 Times in 199 Posts
    >And in fact, the trick goes even further - brotli has transformation rules

    The transformation code is very very short and it more than pays back its added complexity -- I'd recommend it for others to use, too. We get about 8 % improvement from the dictionary on a web corpus and about 1 % is from the transformations.

    Also it has to be noted that perhaps 2 % is because how the dictionary is coded -- every combo of (length,distance) has a different word associated with it, the static dictionary is not available just as bytes.

    I think it would be great if Yann added the exact same mechanism to zstd as this dictionary is readily available in many environments (Windows OS, Apple iOS, Firefox, Chrome, Safari, Android)

  33. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > The transformation code is very very short and it more than pays back its added complexity -- I'd recommend it for others to use, too.

    Sure, but it'd be nice to at least have the whole set of brotli's reference data for codec benchmarks -
    as concatenated versions of dictionary with applied transformations, if necessary.
    Of course, brotli can use its dictionary more efficiently, but its less of a problem than comparing vs no dictionary at all.

  34. #26
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Shelwien View Post
    > If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes?

    Mainly because you're comparing apple to oranges again?
    I mean, brotli outputs raw brotli format, while 7z creates archive formats with header and index.
    Raw optimized deflate output (zopfli or kzip+deflopt + rawdet) is 93 bytes.
    Raw lzma (without .7z or .lzma header) is 94 bytes, plzma is 89 bytes, ppmd 81 bytes.

    ...brotli with zeroed-out dictionary = 97 bytes.
    How do you generate raw LZMA? Is that an option in the 7-Zip utility?

    I don't know why you zero out brotli's dictionary. That's like removing one wheel from a Corvette and saying "See it's not so fast with three wheels." Brotli has a dictionary – that's just the way it is. If others don't use dictionaries, that's a flaw of their choosing. I can only use codecs as they exist in the world, compiled for end-users.

  35. #27
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    I'm not familiar with kzip. When I look for it it seems to be related to KDE or Ken Silverman. I'll try it. I thought Pavlov's was the best gzipper. It was amazing on jQuery, but that might have been a fluke. kzip looks old – I'm surprised that it hasn't been beat by now.

  36. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > How do you generate raw LZMA? Is that an option in the 7-Zip utility?

    No, there's a separately distributed LZMA SDK which includes "lzma" utility.
    But that's not quite enough either, since it encodes to ".lzma" format,
    which has a 14-byte header.
    Here I attached a patched version of raw LZMA codec with minimized overhead.
    Unfortunately it outputs 95 bytes rather than 94, because 94 requires
    also a modified rangecoder flush - I did it with my plzma coder.

    > I don't know why you zero out brotli's dictionary.

    To see how much its result depends on the dictionary.

    > If others don't use dictionaries, that's a flaw of their choosing.

    No, the problem here is that we can use dictionaries with other codecs too,
    but not exactly the same one as brotli's, because brotli dictionary is
    not just a raw block of bytes, but a set of strings sorted by length (afaik).

    So if I'd use a different dictionary with other codec, and it improves
    compression, then it would be also unfair, because brotli utility
    doesn't have an option to use a random file as external dictionary.

    Currently the best approach would be to use zstd --train on your records,
    then zstd -D to compress the records. But that also doesn't guarantee
    that zstd is the best codec for this.
    (The best one would be likely ppmd_dict with zstd's custom dictionary).

    > I can only use codecs as they exist in the world, compiled for end-users.

    Well, your task with random access to compressed small blocks implies
    that there'd be a custom solution based on some codec library.
    Such things are not implemented with standard console utilities.

    > I thought Pavlov's was the best gzipper.

    Its the best practical one - it supports stream processing and MT,
    Kzip and some other optimizers are not very practical - they would
    keep iteratively compressing the file until there's no more improvements,
    which can take a lot of time.

    > It was amazing on jQuery, but that might have been a fluke.
    > kzip looks old – I'm surprised that it hasn't been beat by now.

    There's zopfli: https://github.com/google/zopfli
    Also there're a couple of extra utilities - deflopt and defluff,
    which can sometimes improve the results.
    Attached Files Attached Files

Similar Threads

  1. Replies: 12
    Last Post: 3rd March 2020, 22:29
  2. Replies: 45
    Last Post: 25th November 2016, 03:30
  3. Replies: 69
    Last Post: 16th August 2016, 18:46
  4. paq8px Model For Video Compression
    By davidwjx in forum Data Compression
    Replies: 6
    Last Post: 8th December 2014, 02:32
  5. CM model
    By toffer in forum Forum Archive
    Replies: 40
    Last Post: 26th January 2008, 06:59

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •