Page 2 of 4 FirstFirst 1234 LastLast
Results 31 to 60 of 93

Thread: Data compression explained

  1. #31
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    Here's another DCE diff.
    Major changes:
    - Table of contents added
    - section "6.1.6.3. WinZIP" added (about winzip jpeg codec)

    Code:
    17c17
    < </font></p><p><font face="sans-serif">Last update: Mar. 5, 2010.
    ---
    > </font></p><p><font face="sans-serif">Last update: Mar. 10, 2010.
    
    [...index skipped...]
    
    78c133,137
    <  There is no general procedure for finding good models. There is no algorithm
    ---
    >  Specifically, any string x has probability (about) 2<sup>-|M|</sup> where
    >  M is the shortest possible description of x, and |M| is the length of M in bits,
    >  almost independent of the language in which M is written.
    >  However there is no general procedure for finding M or even estimating |M|
    >  in any language. There is no algorithm
    
    
    243c306,307
    < </font></p><p><font face="sans-serif">Because optimal modeling is not computable, neither is optimal compression.
    ---
    > </font></p><p><font face="sans-serif">Because determining the length of the shortest descriptions of strings
    > is not computable, neither is optimal compression.
    
    
    549c623
    < by N. F. Antonio,
    ---
    > by Nania Francesco Antonio,
    
    
    577c653,654
    < that calculates an optimal assignment over an alphabet of n symbols in O(n) time. 
    ---
    > that calculates an optimal assignment over an alphabet of n symbols in
    > <i>O</i>(n log n) time. 
    
    
    647c725
    < This effectively constrains the model to probabilities that are multiples of 1/2.
    ---
    > This effectively constrains the model to probabilities that are powers of 1/2.
    
    
    733a812,816
    > </font></p><p><font face="sans-serif">The assertions are not necessary to make the code work. 
    > Rather, they are useful because
    > they make the programmer's assumptions explicit in the form of self documenting
    > run time tests. The code is compiled with -DNDEBUG to remove them after debugging.
    
    
    3519a3673,3679
    > <p><font face="sans-serif">Even though JPEG is already compressed, there are lossless algorithms
    > that can improve the compression by 20% to 30% and decompress to the original
    > JPEG image.
    
    
    3543,3544c3703,3704
    < is deterministic, so the result is bitwise identical. The patent is
    < owned by Lovato and Yaakov Gringeler, deleloper of the (now unsupported)
    ---
    > is deterministic, so the result is bitwise identical. The patent was issued to
    > Lovato and Yaakov Gringeler. Gringeler was the deleloper of the (now unsupported)
    
    
    3547,3548c3707,3712
    < </font></p><p><font face="sans-serif">PAQ versions beginning with <a href="paq.html#paq7">PAQ7
    < in Dec. 2005 also include a JPEG model. Both programs compress to about the same
    ---
    > 
    > </font></p><h5><font face="sans-serif">6.1.6.2. PAQ</font></h5>
    > 
    > <p><font face="sans-serif">PAQ versions beginning with <a href="paq.html#paq7">PAQ7
    > in Dec. 2005 also include a JPEG model for baseline images (but not the less common
    > progressive mode). Both PAQ and Stuffit compress to about the same
    
    
    3581c3745,3750
    < in the DCT equation above.
    ---
    > in the DCT equation above, for example in the horizontal direction:
    > </font></p><blockquote>
    > <font face="sans-serif">  S<sub>u</sub> = О+(u) О_<sub>x=0..7</sub>
    >   S<sub>x</sub> cos[П_/8 (x+1/2) u],<br>
    >   where О+(0) = 1/8<sup>1/2</sup>, О+(u) = 1/4, u > 0.
    > </font></blockquote>
    
    
    3605a3775,3980
    > </font></p><h5><font face="sans-serif">6.1.6.3. WinZIP</font></h5>
    [...]
    
    4030,4031c4410,4412
    < <p><font face="sans-serif">I thank Sami Runsas, David A. Scott, Christopher Mattern, Jan Ondrus,
    < Aki JГ?ntti, Yan King Yin, and Eugene Shelwien
    ---
    > <p><font face="sans-serif">I thank Szymon Grabowski, Aki JГ?ntti, 
    > Christopher Mattern, Jan Ondrus, Friedrich Regen, Steve Richfield, Sami Runsas,
    > David A. Scott, Eugene Shelwien, Ali Imran Khan Shirani, Yan King Yin, and Bulat Ziganshin.

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts

  3. #33
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    279
    Thanks
    33
    Thanked 138 Times in 50 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Yeah, I'll have to add all that too. Meanwhile here is a description of the WinZIP JPEG algorithm (section 6.1.6.3)

    http://mattmahoney.net/dc/dce.html
    I think there should be:

    sum(5,6) = |S57/Q57| = |S66/Q66| = |S67/Q67| = |S76/Q75| = |S77/Q77| where Quv is the quantization scale factor of Suv.
    ->
    sum(5,6) = |S57/Q57| + |S66/Q66| + |S67/Q67| + |S76/Q75| + |S77/Q77| where Quv is the quantization scale factor of Suv.

    and

    avg(u,v) is the average of |Uu,v|, |Uu,v-1|, |Uu-1,v|, |Uu-1,v-1|, and 4|Luv|.
    ->
    avg(u,v) is the average of |Uu,v|, |Uu,v-1|, |Uu-1,v|, |Uu-1,v-1|, |Lu,v|, |Lu,v-1|, |Lu-1,v| and |Lu-1,v-1|.

    Quote Originally Posted by Matt Mahoney View Post
    There are a few other strangenesses, like the avg() function not being symmetric with respect to the 2 neighboring blocks
    I think it is symmetric.

    And you can read more about my modifications to PAQ JPEG model here (if you understand czech language)

  4. #34
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks for catching that mistake in sum(). I think that avg() is right even though it looks strange. I think that it is a bug in winzip and they didn't discover it until it was released and too late to change it without breaking compatibility, so they put it in the spec instead. In http://www.winzip.com/wz_jpg_comp.pdf sec. 5.6.2.2 it says

    sum += (abs(Bn[x]) + abs(Bw[k])) * S[x] / S[k]

    which I thing ought to be

    sum += abs(Bn[x]) + abs(Bw[x])

    because otherwise abs(Bw[k]) is added 3 times (once for each x). If they do it that way then they should also add

    sum += (abs(Bn[k]) + abs(Bw[x])) * S[x] / S[k]

    to treat the N and W blocks symmetrically. It probably doesn't hurt compression too much though.

    Also, I cited your thesis, but I can't read Czech. If it looks wrong let me know. I wasn't sure if I should include the subtitle. I described your JPEG model from looking at the code.

    Also, I added some new stuff to the benchmark section 2. I added some analysis of the Calgary corpus, especially book1, pic, and geo, and a FV plot of enwik9.

  5. #35
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    279
    Thanks
    33
    Thanked 138 Times in 50 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks for catching that mistake in sum(). I think that avg() is right even though it looks strange. I think that it is a bug in winzip and they didn't discover it until it was released and too late to change it without breaking compatibility, so they put it in the spec instead. In http://www.winzip.com/wz_jpg_comp.pdf sec. 5.6.2.2 it says

    sum += (abs(Bn[x]) + abs(Bw[k])) * S[x] / S[k]

    which I thing ought to be

    sum += abs(Bn[x]) + abs(Bw[x])
    Yes, i think they meant
    sum += (abs(Bn[x]) + abs(Bw[x])) * S[x] / S[k]
    instead of
    sum += (abs(Bn[x]) + abs(Bw[k])) * S[x] / S[k]

    It looks like typo in the specification because they say above:

    AVG(Bn, Bw, k) is defined as the average of coefficient absolute values at k and the coefficient absolute values at positions directly above, to the left and to the upper-left of k for both the North and West blocks.

  6. #36
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Without source code we'll never know

  7. #37
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    279
    Thanks
    33
    Thanked 138 Times in 50 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Also, I cited your thesis, but I can't read Czech. If it looks wrong let me know. I wasn't sure if I should include the subtitle.
    It looks wrong.

    N?zev pr?ce: Bezztr?tov? komprese JPEG grafiky
    Autor: Jan Ondru?
    Katedra (?stav): Katedra softwarov?ho in?en?rstv?

    translates to

    Title: Lossless compression of JPEG images
    Author: Jan Ondru?
    Department: Department of Software Engineering

  8. #38
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks for translation. This should be better I hope. http://mattmahoney.net/dc/dce.html#ref

    (BTW, I finally uploaded a picture of me. It is on the summit of Mt. La Plata in Colorado at 4300m).

  9. #39
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Recently, I've checked it by W3 validator service and look what I've found:

    http://validator.w3.org/check?uri=ht...Inline&group=0

    You may not bother it's valid or not. But, I think 1082 Errors and 259 warnings worth to consider on syntax. Because, some search engines may fall in trouble while parsing the page.

    So, I've fixed it by myself. It's pretty long to only submit differences.

    Frequently repeated mistakes are:
    - Most of elements (especially p and li) are not closed.
    - <pre> element expect only raw text (not any other element within). So, I've added a CSS class (preformatted) as a workaround and replaced <pre> with <p class="preformatted"> where needed.
    - Element attributes are written without quotes.
    - Some deprecated elements (i.e. center) are used.
    Attached Files Attached Files
    BIT Archiver homepage: www.osmanturan.com

  10. #40
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks. Here is the new version. http://mattmahoney.net/dc/dce.html

  11. #41
    Member Raymond_NGhM's Avatar
    Join Date
    Oct 2008
    Location
    UK
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Implementation of JPEG2000 LLC based on LZMA

    Hi Mr.Matt Mahoney

    When i was looking on "Data compression explained" , one thing come in my mind,
    some years ago i find this refrence on the web...
    at least it's new thing for me...cause we can't find this Implementation in
    latest known Compressors or archivers...

    it's also can be rewrite with CM method but i think LZMA is standard & faster
    on De/Compression, same as in WinZIP JPEG compression.
    Maybe in future, this one add in compressors (HOWEVER in future)

    i'll good see, for interest this whitepaper in DCE:
    Attached Files Attached Files

  12. #42
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks. I need to add a section on JPEG2000 and wavelets and then I can use your reference.

    BTW it looks like this method could be used for regular JPEG too.

  13. #43
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Added a short section on PackJPG. http://mattmahoney.net/dc/dce.html#Section_6164

  14. #44
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    HA by Harri Hirvola is an order-4 PPM - check out the source. Also, HA is not PPMC - it has some tricks on escape calculation - a la first attempts on SEE.

    In addition, mostly, all examples are based on your software. As example, BWT coder you provided is BBB with its results. At the same time we have the BCM (0.09,0.10) that is both faster and provides more compression. Compare the results of BCM 0.09 and BBB on calgary.tar and tell us about bad binary performance...

  15. #45
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks. I plan to do some more benchmarks on the Calgary corpus and include most compressors that are currently maintained like BCM. However I can't give detailed explanations of different algorithms unless there is a paper of some kind describing the algorithm. (I prefer not to have to read the code). There are still some papers I need to read on BWT.

  16. #46
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    http://mattmahoney.net/dc/dce.html

    4.2.2. Removed references to HA using PPM model C.

    1.3. Added proofs that there is no test for randomness. I found a wonderfully simple proof in http://www.scholarpedia.org/article/...eteness_result

    It is a stronger variation of Goedel's incompleteness proof. In any formal system there are only a finite number of strings that can be proven to be random, in the sense that K(x) >= |x|, the shortest description of string x or the shortest program that outputs x is at least as long as x. Since there are an infinite number of strings, and most (also infinite) of them are random, then there are an infinite number of true statements that cannot be proven (Goedel's only proved there must be at least one).

    Proof: pick a large number n, say one million. Most strings n bits long are random. Assume there is at least one of these that you could prove to be random. Then you could enumerate all proofs in your formal system (a set of axioms and rules of inference) and describe "the first string x proven to be n bits long and random" even though this description is less than n bits long. So the assumption must be wrong.

  17. #47
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Added some more Calgary corpus test results.
    http://mattmahoney.net/dc/dce.html#Section_214

  18. #48
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    Btw, here's an alternate scan of the pic image:
    http://nishi.dreamhosters.com/u/pics.rar

    I don't remember where I've got it (Vadim, do you remember anything?),
    but its a cleaner scan which can be compressed better.

    Also a few book1 versions are available too:
    http://nishi.dreamhosters.com/u/book1s.rar
    cleaner and more compressible as well.
    There's even a mp3 version, but we didn't reach that level yet
    unfortunately

  19. #49
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Added a section on MSufSort.
    http://mattmahoney.net/dc/dce.html#Section_555

  20. #50
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Matt will you mention BWTS since Yuta has a routine that is 5n and is linear in time and space. Its the BWT without the index. Yes for long file not much difference but for short file there can be a large difference?

  21. #51
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    I plan to add Yuta-Mori (divsufsort) after I read about it. I think that mcomp, bcm, and bsc use it. There are a huge number of algorithms for suffix sorting.

    I just now added a section on Itoh-Tanaka used (I think) by dark and bsc. I still need to look at the source code so I hope that is not wrong. http://mattmahoney.net/dc/dce.html#Section_556

    Also I should mention that it is possible to compute a BWT with only a small portion of the suffix array in memory at one time by making multiple passes over the block. I guess that is how bwmonstr uses 0.58n memory to compress enwik9 without temporary files, but Sami hasn't given details.

    BWTS is interesting so I need to add that too.

  22. #52
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    162
    Thanks
    0
    Thanked 13 Times in 2 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I plan to add Yuta-Mori (divsufsort) after I read about it. I think that mcomp, bcm, and bsc use it. There are a huge number of algorithms for suffix sorting.
    bsc uses libdivsufsort for forward BWT transform. I have my own implementation of reverse BWT and all ST transform functions.
    Enjoy coding, enjoy life!

  23. #53
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    Obsolete versions of BCM make use of divsufsort. Newer versions uses their own BWT stages, although some parts of a code are still based on divsufsort. Of course UNBWT is of my own...

    Why reinvent the wheel(er)?

  24. #54
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Szymon Grabowski pointed me to this paper by Manfred Kufleitner on BWTS. Apparently you can make the Schindler transform bijective too. http://www.stringology.org/event/2009/p07.html

    The effect on compression is small, though.

    244,106 book1.bwt.fpaq0p
    244,096 book1.bwts.fpaq0p
    985,906 calgary.bwt.fpaq0p
    985,885 calgary.bwts.fpaq0p

    220,338 book1.bwt.fpaq0f2
    220,344 book1.bwts.fpaq0f2
    845,526 calgary.bwt.fpaq0f2
    845,549 calgary.bwts.fpaq0f2

    For bwt I used msufsort. For bwts I used bwts (David A. Scott's mod of Mark Nelson's bwt) recompiled changing BLOCKSIZE from 1000000 to 4000000 so it is big enough to hold calgary.tar.

  25. #55
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Added a brief description of divsufsort. Still need to add BWTS. http://mattmahoney.net/dc/dce.html#Section_557

  26. #56
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Added sections on MSufSort v3 (completely different algorithm than v2) and bijective BWT. http://mattmahoney.net/dc/dce.html#Section_558

  27. #57
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Added sections on MSufSort v3 (completely different algorithm than v2) and bijective BWT. http://mattmahoney.net/dc/dce.html#Section_558
    Thanks for adding in BWTS. However there is at least one error
    that all would agree with. fpaq0s was never meant to be bijective.
    In fact it is not bijective for most files. It was just a simple change
    to eliminate some of the waste in simple version where one is testing
    for the end of file. A true bijective version does exist and its more
    complicated than the one you mention namely arb255.

    I disagree with other statements. But not much.
    One is that BWTS does offer a slight savings on the average
    but it's only a couple of bytes. Of course I am only talking
    on the average where for long files many would have to be
    tested to see any difference at all. In fact about 50% of big
    files smaller with BWTS and 50% smaller using BWT.
    However when one looks at short files say only a few hundred
    bytes. Then the difference starts to add up and BWTS
    wins. Of course one could cherry pick files to show the
    opposite.

    your statement:
    "the factorization into Lyndon words has the effect of compressing each word independently. Generally this would hurt compression, but it might help for mixed data types (where BWT does poorly) by finding segment boundaries."

    Is not true. The effect of compressing each word independently is not
    true. Example in regular BWT the word "cats" would sort together and
    if the file long and only c preceded "ats" then all the c's of "cats" would be
    together in the output file. When doing a BWTS on the same
    long file if the lyndon word boundary does not split on the word
    "cats" then just as in the BWT version all these c's would appear
    in output to together the fact they are on seperate lyndon words is
    of no concern. I don' know the exact number of lyndon words in a file
    but it's rough log(n) lyndon words per file where n number of bytes.

    But thanks for putting it in.

  28. #58
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks for clarifying. I think I have it correct now. I also added an example of the steps in BWTS and its inverse. http://mattmahoney.net/dc/dce.html#Section_559

  29. #59
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks for clarifying. I think I have it correct now. I also added an example of the steps in BWTS and its inverse. http://mattmahoney.net/dc/dce.html#Section_559
    It looks more correct to me. I guess I would have used BANANAS
    since I cherry pick simple examples to make BWTS look much
    better than BWT.

    But even in your example I think the BWTS is better than the BWT
    though since the difference is (AAA,2) versus (A,AA) which uses
    less space in my view but of course you could code it so that
    the (AAA,2) uses less space the fun of compression methods.

    The one point that is not clear to me. And I have trouble explaining
    it is in your example in the column where the sorts occur. You have

    B
    AN
    NA
    AN
    NA
    A

    sorting to
    A
    AN
    AN
    B
    NA
    NA

    to get
    A
    N
    N
    B
    A
    A

    which is absolutely correct
    however someone reading this may get a false
    idea of what is meant by the sorting of strings
    of different lengths.

    In this case you don't have a problem since
    each row the same or you have a difference
    before the end of shorter string.

    but suppose you sort the following strings
    not sure what the word for this sorting but
    its the sorting method in BWTS

    By the way you don't have to change a thing.
    But I think the sorting of these rows the key to
    BWTS. I am sure people tried before but did not
    think out of the box enough to put whole picture
    together.

    suppose for the rotations using your example
    of how to rotate you start with string
    BCBAB goes to BC B AB


    you get these rows

    BC
    CB
    B
    AB
    BA

    now sort
    AB
    BA
    B
    BC
    CB

    not B greater than BA and less than BC
    that is because on a short string you just
    repeat the string when you need more
    characters to compare.

    so BWTS is BABCB

    doing inverse as in your example

    0 A B 1
    1 B A 0
    2 B B 2
    3 B C 4
    4 C B 3

    get cycles (0,1) (2) (3,4)
    Reversing the order of the cycles and concatenating, we get (3,4,2,0,1).
    using the indices you get BCBAB The starting string.

    again thanks for putting it in. I hope people understand this.
    I truly belive that UNBWTS is what UNBWT should have been
    but people missed how to handle the multiple cycles. Yet
    even in a string like ABCABCABC when you do a real BWT with
    no EOS you actually get 3 cycles. They all happen to be the
    same but the actuall BWT is not one cycle for a repeated string.
    The EOS since only used once in most BWT methods gets a
    single cycle since the character only appears once you don't
    get repeated cycles as in real BWT when strings are made a
    repeated substing.

    Thank You Matt

  30. #60
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks for clarifying. I added a couple clarifying remarks.

    I think if a block has only a few large Lyndon words, then the sort order of only a few characters at the ends of the words are much affected, so the effect on compression should be small.

    A small experiment. In BWTS I changed BLOCK_SIZE from 1000000 to 100000000 so the file fits in 1 block. I used MSufSort demo for BWT which always uses 1 block but adds a 4 byte pointer at the start.

    Code:
      831,117 a10.jpg.bwt.fpaq0p
    1,874,201 acrord32.exe.bwt.fpaq0p
    1,302,468 english.dic.bwt.fpaq0p
    3,833,138 FlashMX.pdf.bwt.fpaq0p
    1,110,670 fp.log.bwt.fpaq0p
    2,167,070 mso97.dll.bwt.fpaq0p
    1,069,492 ohs.doc.bwt.fpaq0p
      837,939 rafale.bmp.bwt.fpaq0p
      983,418 vcfiu.hlp.bwt.fpaq0p
      654,036 world95.txt.bwt.fpaq0p
    14,663,549 bytes
    
      831,136 a10.jpg.bwts.fpaq0p
    1,874,196 acrord32.exe.bwts.fpaq0p
    1,302,515 english.dic.bwts.fpaq0p
    3,833,106 FlashMX.pdf.bwts.fpaq0p
    1,110,664 fp.log.bwts.fpaq0p
    2,167,073 mso97.dll.bwts.fpaq0p
    1,069,508 ohs.doc.bwts.fpaq0p
      837,962 rafale.bmp.bwts.fpaq0p
      983,422 vcfiu.hlp.bwts.fpaq0p
      654,016 world95.txt.bwts.fpaq0p
    14,663,598 bytes
    
      836,557 a10.jpg.bwt.fpaq0f2
    1,701,417 acrord32.exe.bwt.fpaq0f2
    1,215,827 english.dic.bwt.fpaq0f2
    3,799,957 FlashMX.pdf.bwt.fpaq0f2
      598,266 fp.log.bwt.fpaq0f2
    2,026,964 mso97.dll.bwt.fpaq0f2
      920,418 ohs.doc.bwt.fpaq0f2
      787,640 rafale.bmp.bwt.fpaq0f2
      720,935 vcfiu.hlp.bwt.fpaq0f2
      510,006 world95.txt.bwt.fpaq0f2
    13,117,987 bytes
    
      836,555 a10.jpg.bwts.fpaq0f2
    1,701,418 acrord32.exe.bwts.fpaq0f2
    1,215,827 english.dic.bwts.fpaq0f2
    3,799,929 FlashMX.pdf.bwts.fpaq0f2
      598,303 fp.log.bwts.fpaq0f2
    2,026,970 mso97.dll.bwts.fpaq0f2
      920,443 ohs.doc.bwts.fpaq0f2
      787,627 rafale.bmp.bwts.fpaq0f2
      720,975 vcfiu.hlp.bwts.fpaq0f2
      509,992 world95.txt.bwts.fpaq0f2
    13,118,039 bytes

Page 2 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. Any money in data compression?
    By bitewing in forum The Off-Topic Lounge
    Replies: 18
    Last Post: 19th March 2019, 11:34
  2. Data compression group on facebook
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 14th May 2010, 23:16
  3. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  4. Data Compression Crisis
    By encode in forum The Off-Topic Lounge
    Replies: 15
    Last Post: 24th May 2009, 20:30
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 15:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •