Page 16 of 29 FirstFirst ... 6141516171826 ... LastLast
Results 451 to 480 of 849

Thread: Tree alpha v0.1 download

  1. #451
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns: "That looks about right, except that you only need one carat (^) to negate the bracket expression."

    Uh-oh. I guess I've been misreading Matt's code all along. (Like I said, I'm not a whiz at regular expressions.)

    Thanks for pointing that out.

    Now I have to figure out if my code's doing what I think, or something else.
    Last edited by Paul W.; 13th November 2014 at 22:45. Reason: attribution

  2. #452
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    OK, this seems to work:

    while (s/{[^{}]*}//g) {};

    It gets rid of most of the stuff we were talking about, which seems to be due to nested sidebar boxes and stuff (and failing to delete everything within the outer pair of curlies). It also reduces the file size by a few percent.

    Unfortunately, that's not all of it. There are some tables with embedded formatting stuff where that stuff isn't inside curly braces or double curlies, and it still gets through.

    For example, if you look starting at line 129,497 of enwik8, you find this section with an inline table:

    ===Season-by-season records===
    :''Since 1920''
    {{Start NFL SBS}}
    |-
    | colspan="6" align="center" | '''Chicago Cardinals (APFA)'''
    |-
    |1920 || 3 || 2 || 1 || 6th APFA<sup>1</sup> || rowspan="2" |  
    |-
    |1921 || 3 || 3 || 2 || 9th APFA<sup>1</sup>
    |-
    | colspan="6" align="center" | '''Chicago Cardinals (NFL)'''
    |-
    |1922 || 8 || 3 || 0 || 3rd NFL || valign="middle" rowspan="3" | ''The NFL did not hold playoff games until 1932''
    |-
    |1923 || 8 || 4 || 0 || 6th NFL
    |-
    |1924 || 5 || 4 || 1 || 8th NFL
    |-

    [...]

    That shows up in the output of Matt's script (modified to let newlines through by adding \n to the final tr pattern) like this:


    season by season records
    since one nine two zero


    colspan six align center chicago cardinals apfa

    one nine two zero three two one six th apfa one rowspan two

    one nine two one three three two nine th apfa one

    colspan six align center chicago cardinals nfl

    one nine two two eight three zero three rd nfl valign middle rowspan three the nfl did not hold playoff games until one nine three two

    one nine two three eight four zero six th nfl

    one nine two four five four one eight th nfl


    [...]

    (My loop doesn't change it.)
    Last edited by Paul W.; 14th November 2014 at 02:26. Reason: typo

  3. #453
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    nburns: "That looks about right, except that you only need one carat (^) to negate the bracket expression."

    Uh-oh. I guess I've been misreading Matt's code all along. (Like I said, I'm not a whiz at regular expressions.)

    Thanks for pointing that out.

    Now I have to figure out if my code's doing what I think, or something else.
    This:

    [^{^}]

    Would probably match any symbol except {, ^, and }. The second carat is probably relatively harmless, unless carats mean something in enwik.

  4. #454
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    Just doing s/{[^{^}]*}//g; several times seems to work, from some initial testing with enwik 6

    I noticed something Matt apparently didn't when I was looking at the differences in output for doing it once, twice, and four times.

    Turns out that the double curlies can be nested, too. They are used to delimit several kinds of things some of which can be nested. (Like the "succession box" for the Abraham Lincoln article, and the translation of Aristotle's name at the beginning of the Aristotle article.)

    The Aristotle thing is weird because it's got five nested curlies, which I'm not sure how to parse---is there a level of single curlies right inside some double curlies, or what?---but as long as we're supposed to throw away anything inside of any curlies, and they're strictly nested as usual for xml
    Curly braces don't have a meaning in xml. I think the curly braces are wiki markup. The combination of xml and wiki markup certainly compounds the challenges.

    Speaking of which: the best way to parse xml is usually with a real xml parser (you can do it with regexes, as this does, but it's harder). Has anyone checked whether there is a tool to automatically convert wiki->text? Getting perfect output might be as simple as running it through an xpath script to remove the xml followed by some wiki-to-text tool.

    Edit:
    Here's a promising lead: http://stackoverflow.com/questions/1...i-page-via-api
    Last edited by nburns; 14th November 2014 at 06:19.

  5. #455
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Nburns, you may be right about using a real xml parser.

    Paul, I think this is a little closer to what you are looking for. I have a feeling it's going to be nearly impossible to get it exactly right with Perl. The number of {'s and }'s aren't the same. Enwik8 has 90906 {'s and 90958 }'s.

    I changed the four lines I had earlier to use the while like you did.
    I added another line to get rid of the table you mentioned:
    Code:
    while (s/\x0A\|[^\x0A]*\x0A/\x0A/g) {};
    Lastly, I moved both of the curly bracket while loops toward the top of processing because it looked like some of the earlier statements were removing a few curly brackets. All the colspan's in enwik8 are gone now.
    Attached Files Attached Files

  6. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (14th November 2014)

  7. #456
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns, Kennon:

    I agree that using xml tools to extract the text would probably be a better way to start, but for now I'm just muddling forward a bit, trying to get something "good enough" for some simple, basic experiments I want to do, without out too much weird garbage left in there.

    Right now I'm just trying to come up with a filter that makes most of Wikipedia look roughly like lightly-formatted ASCII text from the 1970's, like Book1 and Book2 from the Calgary corpus, or Dickens or whatever. After that I'd like to handle a bit more, e.g., the Basic Multilingual Plane of Unicode, so that it doesn't just drop characters with accents or umlauts from people's names, and keeps short strings of Arabic or Chinese or whatever. I'd like to keep whatever text is common in the English version of Wikipedia, including giving Artistotle's name in Greek script, and a few other things it's trivial to do in Unicode, like a superscript 1 or 2, or a Delta in a simple mathematical formula. That's the kind of text that text compression algorithms should be expected able to handle nowadays, even if they're specialized for English---anything a good modern Unicode text editor can easily handle, I guess. I don't know how far I'll get with that, and I'm not sure what the gotchas are.

    Even with all the right tools, there are philosophical issues of what marked-up text should look like when converted to "plain" text. In the limit, it ends up making no sense, but I think it'd often be useful to keep a lot more slightly "weird" stuff.

    Mathematical notation, for example. Up to a point, it's "normal" writing. For many purposes it would nice to preserve as much of it as we can, in some canonical form, so that you can keep things like integral or summation signs with stuff written under them, and so on. But Wikipedia has several ways of writing mathematical formulas, and it'd be very messy in general, and ends up not really being linear text. It's intrinsically marked-up and laid out.

    For another example, take the table in the Arizona Cardinals article that Kennon got rid of... should we really get rid of it? People do put some simple tables like that in "unformatted" ASCII documents or whatever. Like written numbers, they've been part of normal writing since the beginning of writing, and aren't specific to "marked up" documents on computers, so maybe we should just eliminate the inscrutable colspan/align/center/valign stuff, but leave the stuff it modifies like "Arizona Cardinals (APFA)". But there's a whole lot of tables and boilerplate articles in Wikipedia, many of the bigger ones probably generated automatically, and you end up with some very odd-looking n-grams and statistics. I have no idea how to decide which tables are simple and "normal" and "texty" enough to keep and which are big highly-formatted Wikipedia-specific computer-generated blobs to discard. (For now I'm happy to just toss tables because it's too hard to know which tables to keep and what to do with them.)

    Here are a few other issues I've got about how to compare Wikipedia to anything else, like the Corpus of Contemporary American English, or how to analyze or compress it.

    1. There are very many thousands of boilerplate articles about municipalities in the US, automatically generated from US Geographic Survey and US Census data by template-filling. Any one of them looks like a normal (but very dry) just-the-facts-Ma'am article about a podunk town in Montana or wherever, with normal text, but the particular phrases used in that text are repeated thousands of times, so that they dominate the word n-gram statistics for n above 3, and seriously skew the statistics at 3 words, and even at 2. They're a small minority of all articles, but have a big impact on overall statistics.

    2. Many articles are "list" articles, structured differently from normal articles, and many of them are very long, including many of the longest articles in Wikipedia. (E.g., the list of all the people who've ever clerked for the US Supreme Court.) I'm not sure what implications that has.

    3. Image articles (consisting of little more than an image, a caption, and a credit) also skew the word n-gram statistics a lot. The credit often contains a long boilerplate riff about where the picture came from and what license it's released under; that can be dozens of words long, including the longest repeated word n-grams, and some very long ones (upwards of 64 words) repeated thousands of times. That skews the statistics, too, because those long repeating n-grams include repeating n-grams of every shorter length. A lot of the frequent word 5-grams that aren't from template-filling boilerplate about municipalities are from verbatim repetition of image credits. I think that for a lot of purposes, we might want to just ditch the credits and license goop from images; it does not have the properties of normal text. (Although any appropriate boilerplate remover like an LZP filter will get rid of it. That's not true of the weird stuff you get from template-filling.)

    4. Tons of numbers. Tons and tons of them, a lot of them from the template-filling for the municipality articles, but they're all over. (Wikipedia is full of dates, population sizes, areas, monetary amounts, etc., and ranges and percentages and ratios of such things.) If you don't abstract away from the the specific numbers, you get huge bloat of your n-gram model. And numbers come in a lot of formats. (E.g., should you treat $100,000.00 as the five-gram "$<digitstring>,<digitstring>.<digitstring>" or the 2-gram $ <number>, or what?) It depends a lot on what you're trying to do, or how you expect a compression algorithm to treat such things. Bleah.)

    This all makes it hard to even find most of the common English 5-grams like "at the end of the", "in order to accomplish the" or "didn't know how to do". They're pushed way down the frequency list and interspersed with a whole lot of "of the area is water" and "under the Creative Commons License". (And more that are inscrutable because you get the end of one phrase and the beginning of the next. To figure out why they're frequent, you have to find the longer boilerplate that they came from, and why that's frequent.)

  8. #457
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Another paper describing a Tree-like algorithm. This one seems to deliberately use a "top-down"-like approach.
    Attached Files Attached Files

  9. The Following User Says Thank You to nburns For This Useful Post:

    Paul W. (19th November 2014)

  10. #458
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    3. Image articles (consisting of little more than an image, a caption, and a credit) also skew the word n-gram statistics a lot. The credit often contains a long boilerplate riff about where the picture came from and what license it's released under; that can be dozens of words long, including the longest repeated word n-grams, and some very long ones (upwards of 64 words) repeated thousands of times. That skews the statistics, too, because those long repeating n-grams include repeating n-grams of every shorter length.
    I think one of the reasons Tree compresses text well compared to LZ algorithms is because of the way it generally compresses outside-in. The long boilerplate things that occur lots of times get put in the dictionary early, reducing the counts of the the phrases in the dictionary entry to one (per time it appears in the "word"). Then the dictionary entry gets compressed along with everything else. The longest dictionary entries are hundreds of symbols long instead of up to something like 23K characters long. By the time compression gets to the shorter phrases, many of the longer phrases have been reduced to a dictionary entry whose subphrases only get counted once.

    Quote Originally Posted by Paul W. View Post
    This all makes it hard to even find most of the common English 5-grams like "at the end of the", "in order to accomplish the" or "didn't know how to do". They're pushed way down the frequency list and interspersed with a whole lot of "of the area is water" and "under the Creative Commons License". (And more that are inscrutable because you get the end of one phrase and the beginning of the next. To figure out why they're frequent, you have to find the longer boilerplate that they came from, and why that's frequent.)
    I view this as an opportunity for improvement via context modeling.

  11. #459
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Another paper describing a Tree-like algorithm. This one seems to deliberately use a "top-down"-like approach.


    It's a little over my head in places. They seem to try to have some similar ideas. I consider Tree to not be a smallest grammar compressor, but rather a smallest entropy compressor. The entropy starts going up before the smallest grammar is reached. Maybe entropy doesn't matter for genetics or somewhere else, because the smallest grammar problem does seem to draw a lot of attention.

  12. #460
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.15

    Tree v0.15 has changes in TreeCompress64, TreeBitEncode and TreeDecode.

    TreeCompress64 uses a slightly different scoring algorithm that generally compresses a little better than before.

    TreeBitEncode now uses a stack to save symbols that are range coded and clears the stack when enough non-range coded bits are seen (12) to clear the stack. This fixes the bug encountered when encoding one of the text9 files. TreeBitEncode and TreeDecode now use a single symbol representing the combined symbol probability and symbol length when new dictionary symbols are defined. This was done because there are correlations between these values - strings that occur more frequently are more likely to contain two symbols and strings that occur less frequently are less likely to contain two symbols. The exact probabilities and correlations vary from file to file, but follow the same general pattern.

    Benchmark compressed file size and compression/decompression times:

    enwik8:
    TreeCompress: 21,922,356 bytes, 3661 sec./0.74 sec.
    TreeParagraphs+Words+Compress: 22,023,144 bytes, 1907 sec./0.72 sec.
    TreeCompress64: 22,140,724 bytes, 516 sec./0.75 sec.
    TreeParagraphs+Words+Compress64: 22,155,772 bytes, 340 sec./0.74 sec.

    enwik9:
    TreeCompress: 176,896,672 bytes, 114,733 sec./7.2 sec.
    TreeParagraphs+Words+Compress: 178,321,588 bytes, 36,828 sec./7.0 sec.
    TreeCompress64: 177,974,208 bytes, 9542 sec./7.4 sec.
    TreeParagraphs+Words+Compress64: 178,874,272 bytes, 7269 sec/7.2 sec.

    TreeDecode.zip is 11,212 bytes

    The enwik9 TreeCompress results move the Pareto Frontier for decompression of enwik9 down by about 0.3 seconds and 450KB.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 21st November 2014 at 20:19.

  13. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    surfersat (21st November 2014)

  14. #461
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think one of the reasons Tree compresses text well compared to LZ algorithms is because of the way it generally compresses outside-in. The long boilerplate things that occur lots of times get put in the dictionary early, reducing the counts of the the phrases in the dictionary entry to one (per time it appears in the "word"). Then the dictionary entry gets compressed along with everything else. The longest dictionary entries are hundreds of symbols long instead of up to something like 23K characters long. By the time compression gets to the shorter phrases, many of the longer phrases have been reduced to a dictionary entry whose subphrases only get counted once.
    I think Tree performs well because of the way it recycles identical matches, instead of creating lots of slightly-different versions of a match, which LZ does. (That might be what you're trying to say.)

    Quote Originally Posted by Kennon Conrad View Post
    It's a little over my head in places. They seem to try to have some similar ideas. I consider Tree to not be a smallest grammar compressor, but rather a smallest entropy compressor. The entropy starts going up before the smallest grammar is reached. Maybe entropy doesn't matter for genetics or somewhere else, because the smallest grammar problem does seem to draw a lot of attention.
    I have honestly not given it enough thought, but my hunch is that the problems are very closely related.

  15. #462
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I think Tree performs well because of the way it recycles identical matches, instead of creating lots of slightly-different versions of a match, which LZ does. (That might be what you're trying to say.)



    I have honestly not given it enough thought, but my hunch is that the problems are very closely related.
    I was referring to the fact that by the time Tree gets around to figuring out how to compress short phrases, many longer phrases that occur often and contain those shorter phrases have already been deduplicated so they no longer throw off the counts for the dictionary symbol creation decisions made for the shorter phrases. Going inside-out/bottom-up also recycles lot of identical matches (and can give decent compression), but doesn't compress as well as outside-in.

    Yes, I think the problems are related. I think Tree would get close to the minimum grammar if I let it run until no more symbols could be combined. Most likely, the scoring would need to be adjusted for best performance, but I think the differences wouldn't be very much.
    Last edited by Kennon Conrad; 22nd November 2014 at 09:18.

  16. #463
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I started working on improving the MTF coding of Tree and started by making the queues use circular buffers. This version of TreeDecode decodes the same as TreeDecode v0.15, except it is a little faster:

    enwik9:
    TreeCompress: 7.2 sec. -> 6.9 sec.
    TreeParagraphs+Words+Compress: 7.0 sec. -> 6.6 sec.
    TreeCompress64: 7.4 sec. -> 7.0 sec.
    TreeParagraphs+Words+Compress64: 7.2 sec. -> 6.8 sec.

    TreeDecode.zip is 11,203 bytes, so Tree's point on the Pareto Frontier for decompression of enwik9 is now 6.9 seconds to decompress a 176,907,875 byte file.

    The next version will likely trade some speed for better compressed file size.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 22nd November 2014 at 22:52.

  17. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Sportman (23rd November 2014)

  18. #464
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I updated LTCB. How much memory is used? I guessed 1800 MB for the 32 bit version.

  19. #465
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I updated LTCB. How much memory is used? I guessed 1800 MB for the 32 bit version.
    Thanks. I didn't watch too carefully, but saw it in the 1700's for the 32-bit version, so 1800 is fine. The others are a bit under 5200 and 3900 for Compress64 and Compress64+Paragraphs+Words. Is there a program like timer that captures peak memory usage?

  20. #466
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I was referring to the fact that by the time Tree gets around to figuring out how to compress short phrases, many longer phrases that occur often and contain those shorter phrases have already been deduplicated so they no longer throw off the counts for the dictionary symbol creation decisions made for the shorter phrases. Going inside-out/bottom-up also recycles lot of identical matches (and can give decent compression), but doesn't compress as well as outside-in.
    Okay, but that doesn't have anything to do with LZ. LZ is neither top-down nor bottom-up.

    Yes, I think the problems are related. I think Tree would get close to the minimum grammar if I let it run until no more symbols could be combined. Most likely, the scoring would need to be adjusted for best performance, but I think the differences wouldn't be very much.
    In that paper, they talk about substituting different score functions:

    Among the good candidates to be applied to biological sequences, Longest-
    First, RePair and Greedy behave very similarly: they are all off-line, and
    they iterate over the grammar that is being built, selecting in each iteration a
    repeat that maximizes a score function. In order to compare in a uniform framework
    the behavior of the different score functions, we implemented them in a
    general schema that we called IRR (for Iteratively Repeat Replacement). First,
    the grammar being built is initialized with a unique initial rule S -> s where
    s is the input sequence, and then IRR proceeds iteratively. At each iteration,
    a word w occurring more than once in the grammar is chosen according to a
    score function f. All the (non-overlapping) occurrences of w in the grammar are
    replaced by a new non-terminal N and a new rewriting rule N -> w is added to
    the grammar.
    The choice of the score function instantiates different algorithms. As we
    mentioned, the most popular ones are maximal length (ML), most frequent
    (MF), and most compressive (MC) which chose the repeat that is longest, most
    frequent or compresses best the grammar respectively. The three corresponding
    algorithms are called IRR-ML, IRR-MF, IRR-MC. See [10] for details.
    English is evidently not the first language of these researchers: Iterative Repeat Replacement probably would have been a better name. But the idea is pretty clear. Tree obviously fits the IRR framework.

  21. #467
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Tree v0.15 IIS log file:

    635,346,739 bytes to 16,353,576 bytes in 11,179.476 sec.
    Reverse 505 msec.
    Compare OK.
    Default settings.

  22. #468
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Okay, but that doesn't have anything to do with LZ. LZ is neither top-down nor bottom-up.
    I think it depends on which LZ format you are talking about. LZMA is flexible enough (or not restrictive enough, depending on your perspective) that the designer of a specific LZMA algorithm gets/has to choose which symbols go in the dictionary. The designer could choose to make decisions iteratively and could choose a top-down or bottom up approach.

    Basically, I look at choosing the dictionary symbols and encoding the data as two separate problems (when using a dictionary). Even with the described IRR framework, the data still needs to be encoded. Univeral LZ encoding is a good fit for what the authors of the paper you posted call IRR since it supports multi-level symbols. In many ways, it is similar to LZMA encoding.

  23. #469
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. In Windows I use timer32 or watch Task Manager to determine memory usage. In Linux I watch resident memory in top.

  24. #470
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Tree v0.15 IIS log file:

    635,346,739 bytes to 16,353,576 bytes in 11,179.476 sec.
    Reverse 505 msec.
    Compare OK.
    Default settings.
    Thanks, Sportman. Is this the same log file that you posted the following result for in post #254?

    15,851,524 bytes, 11,233.358 sec. - 1.125 sec., TreeComp.bat

    If so, it looks like decoding speed has been improved signficantly, but compression is about 0.5 MB worse. If you are willing, I would like to better understand why the file has grown rather than guessing. Is it possible to run Tree v0.10's TreeBitEncode/TreeDecode on the output of TreeCompress v0.15? If the difference is in the encoding, then results from v0.11 of TreeBitEncode/TreeDecode would be helpful too.

  25. #471
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks. In Windows I use timer32 or watch Task Manager to determine memory usage. In Linux I watch resident memory in top.
    I use timer, but it doesn't provide any memory data. I redownloaded the package your site has a link too and only see Timer.exe and when run only see timing information. Is there a command line flag or different program (timer32?) that provides memory usage data? It's not so convenient to watch Task Manager when compression takes a long time.

  26. #472
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I'm using timer v12.00. There is a version 14 now (7bench1400.7z) that includes timer32.exe and timer64.exe that give time and memory usage.
    http://sourceforge.net/projects/seve...d?source=files

  27. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (25th November 2014)

  28. #473
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think it depends on which LZ format you are talking about. LZMA is flexible enough (or not restrictive enough, depending on your perspective) that the designer of a specific LZMA algorithm gets/has to choose which symbols go in the dictionary. The designer could choose to make decisions iteratively and could choose a top-down or bottom up approach.
    I'm not intimately familiar with LZMA, but I think what LZ77-based compressors typically do is pick matches as (start, length) pairs and then run the result through Huffman (or equivalent). Now, each algorithm has additional details that I'm glossing over, of course. But, unlike Tree or IRR compressors generally, there is no requirement that the LZ matches can be arranged into a neat, concise set of production rules (a grammar). Tree does seek a concise grammar, which -- presumably -- requires an expensive off-line search. LZ algorithms are typically on-line and fairly greedy in how they choose matches -- and many use sliding windows, which make some matches unavailable. Furthermore, the way the LZ format works, even if they reuse the exact same matches, the pairs might look different (because the offsets may count backwards or be relative to the sliding window). So all this conspires against the final Huffman pass and makes the compression achieved fairly haphazard. LZMA is an advanced LZ variant, so it might have made some progress in attacking some of these limitations. But even if they could use Tree-style match selection, current LZ algorithms almost certainly don't. (That's not necessarily a flaw, because they're much faster than Tree.)

  29. #474
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I have no experience with LZMA, but it has repeats, unlike many (all?) other LZ algorithms. It seems like repeats would work best with a concise grammar, and I think that's where the Markov part comes in. But I don't know the production rules. It's something I would like to look into because there could be techniques that would be beneficial for Tree.

    It would be interesting to use Tree's encoder on a transformed LZMA stream. The transformation would consist of converting initial literatals to definitions, subsequent literals to matches, and creating definitions at the source of matches. I suspect the transformation would generally provide better compression and faster decompression of text-like files than LZMA currently offers.

    On another subject, do you know of any good papers on how best to perform hybrid dictionary/MTF schemes for concise grammars? It has become apparent to me that Tree is missing out on a lot of mtf opportunities for the less rare symbols (>20 occurances). It looks like changing Tree's mtf1 code to a queue of 200 symbols could reduce enwik9 by an additional 4 MB, but the scheme isn't ideal. The "problem" I am seeing is that MTF symbols with different global probabilities should have their likelihood depreciated at different rates. I suppose I could context model the queue position probabilities based on recent symbol probability history, but that seems overly complicated for my coding skills.

  30. #475
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I have no experience with LZMA, but it has repeats, unlike many (all?) other LZ algorithms. It seems like repeats would work best with a concise grammar, and I think that's where the Markov part comes in. But I don't know the production rules. It's something I would like to look into because there could be techniques that would be beneficial for Tree.

    It would be interesting to use Tree's encoder on a transformed LZMA stream. The transformation would consist of converting initial literatals to definitions, subsequent literals to matches, and creating definitions at the source of matches. I suspect the transformation would generally provide better compression and faster decompression of text-like files than LZMA currently offers.
    So, you'd take the matches found by LZMA but change how they're represented? From what I understand, LZMA is pretty highly tuned. But good luck.

    On another subject, do you know of any good papers on how best to perform hybrid dictionary/MTF schemes for concise grammars? It has become apparent to me that Tree is missing out on a lot of mtf opportunities for the less rare symbols (>20 occurances). It looks like changing Tree's mtf1 code to a queue of 200 symbols could reduce enwik9 by an additional 4 MB, but the scheme isn't ideal. The "problem" I am seeing is that MTF symbols with different global probabilities should have their likelihood depreciated at different rates. I suppose I could context model the queue position probabilities based on recent symbol probability history, but that seems overly complicated for my coding skills.
    When MTF appears in the literature, it's usually in relation to the BWT. I can't think of any papers to suggest, and I couldn't find anything with a web search. One thing you could try would be to track down and read all of the grammar compression papers you can find and see what they're doing and maybe something relevant will come up. I have a feeling that that problem is still open and so you'll get to make it up as you go.
    Last edited by nburns; 27th November 2014 at 00:02.

  31. #476
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    LZMA search stage just finds all possible matches for each position. it may be used as the basis for any following entropy compression with optimal match choosing

  32. #477
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    LZMA search stage just finds all possible matches for each position. it may be used as the basis for any following entropy compression with optimal match choosing
    I did a little research. It sounds like optimal match choosing is problematic for LZMA similarly to Tree. It seems to be accepted that global decisions would work better but are not very practical so the entropy coders typically make local decisions. If I understand correctly, more advanced encoding looks at the forward impact of matches, but it's still not making decisions globally.

    So perhaps it's not possible to make significant improvements in LZMA etropy coding, but is possible to make significant improvements in match selection, with a serious loss of compression practicality.

    @nburns: You are probably right. Some of the details probably make it harder than I initially imagined. Now that I know more about LZMA, I think Tree's better compression ratio for enwik9 probably comes from using a cleaner symbol set derived globally. But I am just guessing, so some sort of test would help clarify why Tree creates a smaller compressed enwik9 than LZMA. I would like to know how much comes from symbol selection and how much comes from having fewer references to things that are not referenced.

  33. #478
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    [...] I think Tree's better compression ratio for enwik9 probably comes from using a cleaner symbol set derived globally. But I am just guessing, so some sort of test would help clarify why Tree creates a smaller compressed enwik9 than LZMA. I would like to know how much comes from symbol selection and how much comes from having fewer references to things that are not referenced.
    I wonder how much of the difference is simply because LZMA uses LZ77-style byte offsets rather than LZ78-style dictionary indexes.

    LZMA compresses the offsets by keeping just a few recently-used offsets and doing something resembling MTF-coding of them, but when that doesn't work, the inefficiency of byte offsets vs. string numbers still matters.

    I don't really understand all of LZMA's modeling and entropy coding after the match-finding, but I'd guess that it makes a substantial difference.

    I'm wondering how much difference it would make to use a ROLZ-type offset compression scheme in LZMA. I'd guess that would narrow the gap significantly.

    I'm also wondering how much difference it would make to just use a bigger MTF queue of recently-used offsets.

    Does anybody know of any experiments that have been done to quantify the effects of various features of LZMA?

    (Given how well it works and how hard it is to improve it, I assume a lot of informal experimentation was done during development. It'd be nice to know the results of that... e.g., does LZMA use a 4-element MTF queue of offsets because 5 or 16 wouldn't get significantly better compression, or only because it'd make decompression significantly slower?)
    Last edited by Paul W.; 27th November 2014 at 19:24. Reason: corrected LZMA MTF queue size

  34. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (27th November 2014)

  35. #479
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Maybe somebody will correct me, but my impression is that LZMA doesn't do anything like Tree's MTF modeling of recently-seen novel strings.

    As I understand it, the only MTF-ing it does is on LZ77-style relative offsets, to detect and exploit strides, and not to model "hotness" (recency) of specific strings as in Tree.

    Those are very different things, and combining them would presumably result in better compression if you have a mix of normally "texty" data and repeating structures.

  36. #480
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Gosh, I was completely off base on LZMA. I saw Wikipedia's description that mentions a "dictionary" and started making assumptions without reading enough details. Paul, your comments made me realize I was misunderstanding LZMA. As far as I can tell, there is not a dictionary that can be directly referenced. So I don't think I will learn anything from LZMA about global symbol selection and also think much of the difference is from using dictionary references instead of byte offsets.

    Some statistics from Tree v0.15's (TreeCompress) compressed enwik9 file:
    mtf (2 - 20 instances): 4,197,656 hits, 73,112,243 bits saved
    mtf1 (code lengths 11 - 22): 3,265,078 hits, 26,784,319 bits saved

    86,387,490 symbols coded from code space
    78,924,756 symbols coded from code space except mtf/mtf1
    mtf consumes 216/4096 of code space
    mtf1 consumes 96/4096 of code space
    approximate penalty for non mtf/mtf1 symbols: 9,021,386 bits

    So altogether, it appears that mtf is saving 90,875,176 bits or 11,359,397 bytes.

  37. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (28th November 2014)

Page 16 of 29 FirstFirst ... 6141516171826 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •