Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • mitiko's Avatar
    Today, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12194 view(s)
  • mitiko's Avatar
    Today, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    15 replies | 715 view(s)
  • CompressMaster's Avatar
    Today, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137044 view(s)
  • Jarek's Avatar
    Today, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    120 replies | 91098 view(s)
  • LucaBiondi's Avatar
    Today, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 624966 view(s)
  • Gotty's Avatar
    Today, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 624966 view(s)
  • hexagone's Avatar
    Today, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 624966 view(s)
  • Gotty's Avatar
    Today, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    15 replies | 715 view(s)
  • JamesWasil's Avatar
    Today, 03:04
    JamesWasil replied to a thread WinRAR in Data Compression
    This may have been missed or not seen a year or two ago, but found it interesting that a vulnerability like this existed for almost 20 years in the wild. Technically not WinRAR's fault, but support for WinACE and external formats was why and how the exploit came to be: https://www.theverge.com/2019/2/21/18234448/winrar-winace-19-year-old-vulnerability-patched-version-5-70-beta-1
    192 replies | 137044 view(s)
  • Alexander Rhatushnyak's Avatar
    Today, 02:19
    I won't submit anything neither this year nor next year. Not later than in late 2022 someone else will win a prize. Maybe a simplified version of cmix or nncp, maybe indeed something like paq8px with a better NLP part.
    78 replies | 12194 view(s)
  • Gotty's Avatar
    Today, 00:26
    Gotty replied to a thread paq8px in Data Compression
    It looks like the timing is not reliable - look at the last 2 rows: (Paq8px_v201 -10 / Paq8px_v201 -11). In the xml column they are the same (to the hundredth), which is probably an error. Ignoring this column the ratio still varies between 1.29 and 4.29. That's a very large spread. Luca was probably running multiple instances at the same time or run other tasks beside paq8px. So it's best to ignore the timings. They are not really useful. If you still would like to compare timings, you may need to clean the results by scaling the outliers but most importantly compare the results along the same command line options. You may compare versions along the compression level (8 to 8, 9 to 9, etc), and especially if lstm was used or not used. I'd say the "slowdown" from v95 to v201 is probably about 3x. Yes it's still a lot. But for 3-4% compression gain it worth it.
    2337 replies | 624966 view(s)
  • Scope's Avatar
    Yesterday, 22:13
    https://gitlab.com/wg1/jpeg-xl/-/commit/65825c8629f181dbe8bdd432b15576735af29701
    15 replies | 1995 view(s)
  • Fallon's Avatar
    Yesterday, 15:39
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR - What's new in the latest version https://www.rarlab.com/download.htm Version 6.01 beta 1
    192 replies | 137044 view(s)
  • Shelwien's Avatar
    Yesterday, 14:07
    https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors I made a benchmark which times something like this loop: #define rndup(seed) (seed = 214013 * seed + 2531011) s0=1*seedinc; s1=2*seedinc; s2=3*seedinc; s3=4*seedinc; for( i=0; i<numreads; i++ ) { s0 += buf; s1 += buf; s2 += buf; s3 += buf; rndup(s0);rndup(s1);rndup(s2);rndup(s3); } for number of streams 1..64 (loop code is generated by a script) and tested it on two CPUs: i7-7820X @ 4.5Ghz and i7-8650U @ 2.5Ghz (see pictures). For 7820 there's no clear result, just that timings converge after ~16 streams and there's no further speed improvement (per stream). For 8650 it seems like there're ~16 streams for L3 and 6-8 for RAM, there's a visible speed difference after that.
    0 replies | 84 view(s)
  • Gotty's Avatar
    Yesterday, 12:29
    First of all, your use case is not clear to me: what is the purpose of hashing in your case? Do you need uniqueness? If yes, why? Do you need characters in the hash? Or bytes are also OK? >>to convert a street address Addresses are terribly regarding their quality. What are you planning to do with addresses that are written differently but they represent the same actual address? Do you plan a preprocessing stage and clean the addresses up?
    2 replies | 58 view(s)
  • Shelwien's Avatar
    Yesterday, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    2 replies | 58 view(s)
  • SolidComp's Avatar
    Yesterday, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    2 replies | 58 view(s)
  • LucaBiondi's Avatar
    Yesterday, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me @Darek please! Luca
    2337 replies | 624966 view(s)
  • Darek's Avatar
    Yesterday, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2337 replies | 624966 view(s)
  • hexagone's Avatar
    4th March 2021, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2337 replies | 624966 view(s)
  • Hacker's Avatar
    4th March 2021, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2377 view(s)
  • Jarek's Avatar
    4th March 2021, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    458 replies | 147477 view(s)
  • SolidComp's Avatar
    4th March 2021, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 116 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2337 replies | 624966 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2337 replies | 624966 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2337 replies | 624966 view(s)
  • Darek's Avatar
    3rd March 2021, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2337 replies | 624966 view(s)
  • Shelwien's Avatar
    3rd March 2021, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    78 replies | 12194 view(s)
  • mitiko's Avatar
    2nd March 2021, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    15 replies | 715 view(s)
  • SolidComp's Avatar
    2nd March 2021, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    15 replies | 715 view(s)
  • byronknoll's Avatar
    2nd March 2021, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    78 replies | 12194 view(s)
  • Gotty's Avatar
    2nd March 2021, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    21 replies | 607 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    21 replies | 607 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2168 view(s)
  • xinix's Avatar
    2nd March 2021, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    21 replies | 607 view(s)
  • Trench's Avatar
    2nd March 2021, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    21 replies | 607 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2168 view(s)
  • Gotty's Avatar
    2nd March 2021, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2168 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:03
    Sportman's provided random files ( pls leave any further requests for technical details for when separate encoder decoder done. None will be responded at this time)
    66 replies | 2168 view(s)
  • Gotty's Avatar
    2nd March 2021, 01:46
    Are these inputs random files?
    66 replies | 2168 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 01:34
    Hi , I have made good progress coding done compressor and have now greatly made simpler decompressor At moment decompressor works direct from compressed internal variables decompressed exact same. There is only work to separate into encoder decoder. But do not underestimate the amount of 'mundane' tedious work involved .Likely will be a week's time to able do as requested. But we can and should already be celebrating. ( now can handle any size input also ) LawCounsels
    66 replies | 2168 view(s)
  • Gotty's Avatar
    1st March 2021, 20:39
    Gotty replied to a thread paq8px in Data Compression
    Does it happen around the same %? How does it crash? Blue screen?
    2337 replies | 624966 view(s)
  • Shelwien's Avatar
    1st March 2021, 02:58
    > am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? In a similar case in PPM (called masked contexts) symbols seen in higher order contexts before escape are excluded from probability estimation. But in most CM implementations, submodels (based on contexts of different orders in this case) are handled independently, thus multiple contexts could provide predictions based on same data. Overall first approach is better, but is also harder to implement in a CM, and CM provides better compression than PPM even without it.
    196 replies | 17159 view(s)
  • Gotty's Avatar
    1st March 2021, 01:25
    I agree: unfamiliar. Actually there is a better word. It is called "unpredictable". See the wikipedia definition of randomness: "randomness is the apparent lack of pattern or predictability in events". See? "Apparent lack". It means that the file may have a pattern, you just don't find it or see it. From your viewpoint then the file is practically random. There is no way you can compress it. I also agree: "all files are random", but you need to add "... until you find some useful patterns". Otherwise this statement in itself is not valid.
    21 replies | 607 view(s)
  • Gotty's Avatar
    1st March 2021, 01:11
    Let's do another (different) approach for the second game. Let's say that you record the answers to my questions with ones ("yes") and zeroes ("no"). Here are my questions, and your answers 1. Is x>= 128? No 2. Is x>= 64? No 3. Is x>= 32? No 4. Is x>= 16? No 5. Is x>= 8? Yes (we are in 8..15) 6. Is x>= 12? Yes (we are in 12..15) 7. Is x>= 14? No (we are in 12..13) 8. Is x>= 13? Yes Then it is 13. What did you record? No-no-no-no-yes-yes-no-yes -> 00001101. This looks like 8 bits. In other words a byte... Hmmm... Let's look up this number in decimal. It's 13. Wow! Try with other numbers. The recorded yes-no answers will always match the number's binary representation. Isn't that cool? Your yes-no answers and the binary format will be the same when you follow the above question-structure. If the number is between 1-65536, that's 16 yes-no questions. Can you calculate the required number of questions somehow? Yes. The needed answers is always log2(n) where n is the number of possible symbols (the range). So log2(256) = 8 and log2(65536) = 16. Now think with me. A bit more concentration... Let's do the opposite. You thought of 8 bits (8 numbers of either 0 or 1). Let's stay at the same example. Let they be 00001101. How can I find out which 8 bits you have? I need to simply ask the bits one after the other: Is the 1st bit 1? No Is the 2nd bit 1? No Is the 3rd bit 1? No Is the 4th bit 1? No Is the 5th bit 1? Yes Is the 6th bit 1? Yes Is the 7th bit 1? No Is the 8th bit 1? Yes Notice that again it's No-no-no-no-yes-yes-no-yes. It matches exactly with the bits (00001101) again. And that represents 13. You needed 8 yes-no questions, in other words: 8 bits. It's even more amazing when you have probabilities for the symbols. ((Actually this is the universal case it covers the above yes-no questions, as well.)) You can have questions like that (guessing again 13 from 1..256): Is it 1? (p=1/256) No Is it 2? (p=1/255) No Is it 3? (p=1/254) No Is it 4? (p=1/253) No Is it 5? (p=1/252) No Is it 6? (p=1/251) No Is it 7? (p=1/250) No Is it 8? (p=1/249) No Is it 9? (p=1/248) No Is it 10? (p=1/247) No Is it 11? (p=1/246) No Is it 12? (p=1/245) No Is it 13? (p=1/244) Yes Got it: it's 13. In order to find out the information content, you'll need to sum {-log2(1-p) for the "No"s and -log(p) for the "Yes"es} for the above answers. Here is the result for each question: Is it 1? (p=1/256) No 0.005646563 Is it 2? (p=1/255) No 0.00566875 Is it 3? (p=1/254) No 0.005691112 Is it 4? (p=1/253) No 0.005713651 Is it 5? (p=1/252) No 0.00573637 Is it 6? (p=1/251) No 0.005759269 Is it 7? (p=1/250) No 0.005782353 Is it 8? (p=1/249) No 0.005805622 Is it 9? (p=1/248) No 0.005829079 Is it 10? (p=1/247) No 0.005852726 Is it 11? (p=1/246) No 0.005876566 Is it 12? (p=1/245) No 0.005900601 Is it 13? (p=1/244) Yes 7.930737338 Sum those numbers and be amazed. It's again 8. (It's 13 questions with skewed probabilities with 8 bits of information content.) You can also start from the top: Is it 256?, 255?... 13?. Even you had much more questions the answer will be still 8. Try. No matter what order you do, no matter how you slice up your range, it will be always 8 bits. See? The method or the used number of symbols is irrelevant: whether it's 8 bits or 1 byte (base 2 or base 256): it's still 256 symbols, you always need 8 bits to represent a number from this range. Actually if you store anything in any format it's always the same: the information content is: how many yes-no questions do you need to figure out the full content. This is the number of bits. You cannot do in less - you cannot compress these 8 bits of information to be less. If you understand all the above (even partially) you will have a better grasp: no matter how you transform your random files, no matter what approach you use, no matter with how many symbols you use to represent your data, the information content will be always the same. (4K in your case.)
    21 replies | 607 view(s)
  • Self_Recursive_Data's Avatar
    1st March 2021, 00:26
    Does the following help prediction? How if so? I tried it one time but it didn't work. If we take 2 sets/ orders of predictions to blend: SET1 my dog>ate my dog>ate + SET2 dog>ate dog>ate dog>jumped , am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? They are in the same location in the dataset, cuz order3 is found in order2 but not the other way around always. The dataset looks like (with to emphasize the matches): "], ], ". As we can see here, some of the order3 is order2.
    196 replies | 17159 view(s)
  • Gotty's Avatar
    1st March 2021, 00:17
    Not bad! So you figured out that you need to do halving to find the number. The correct answer is 8 actually, but your approach is good, so accepted ;-).
    21 replies | 607 view(s)
  • Gotty's Avatar
    1st March 2021, 00:16
    Stuck = no improvements. But we do have improvements in the last 10 years in every field (image, video, general) both practical and experimental. Are you following the news? :-) As time progresses, the leaps will be smaller and smaller - we are slowly but surely approaching the theoretical compression limit that is reachable in reasonable time. So don't expect anything "extraordinary". Also don't expect that one day we will invent a smart method to compress random data. I think that would count a big leap, right? 2-digit ascii compression would be everywhere if that would be a common format to share data. But no one share data like that. So it's not in focus at all. You convert your binary file to a 2-digit file when you want to visualize the bits. Other than that it doesn't seem to be a useful format. It's just a waste of space. Do you use or see 2-digit files? Look, xinix also immediately converted it to binary. Why? That's the natural and compact container for such content.
    21 replies | 607 view(s)
  • Trench's Avatar
    28th February 2021, 23:11
    ---- Ok I just saw post #8. As explained in another post that all files are random. you can take any file and it will look random to you at face value. The program does not see it as random since it looks for certain patterns. And those patterns become "familiar" to be able to compress it. even with the dictionary size of a 128 it still has limitation. if the dictionary size was 4 it misses out on the other 122 which you call random when it is obviously not. It is just not familiar to the program in what to use. The computer does not know the difference from 1234 or 1z$T. You see the difference but you are not doing it manul by your own dictionary in your head, but the computer does not unless you tell it. For people that talk technical random should not be in the definition but it is unfamiliar should be. And I stick by that despite I can slip up and say it at times being influenced by others. LOL Anyway You are not stuck? so you made leaps and bounds on compression or did we all hit a wall for the past 10 years to see major improvements for a standard model of compression? Some programs are improved by a bit but not practical which is why not in general use. You help explain the comprehension barrier, but the issue still remains that I do not see any effort to improve 2 digit compression since everyone is focued on the standard, and it is an alert to focus on. People can disagree or disregard but its a new door to explore. You notice how a computer has memory? well it tried to copy the mind and it is hard to keep track when memory is full with other things. LOL So I lost track on the 33,420 info. lol I disagree that both are not random and dont like that word which You can not beat 4.0kb the question is will you try to. from doubling the numbers 1 2 4 8 , 16 32 64, 128 256 1. is it between 1-128? (2/3 off) yes 2. is it 8 to 32? (2/3 off) yes 3. is it 8-16? yes 8 9 10, 11 12 13, 14 15 16 4. is it 11-13? yes 5.is it 11-12? no 6. answer is 13. which would means 6 questions maybe some errors. It was the first thing that came to mind so I could be wrong but did it fast. faster but inaccurate to a smaller degree. if you said 7 I would have lost. LOL will reply to rest later.
    21 replies | 607 view(s)
  • hexagone's Avatar
    28th February 2021, 23:02
    Code is here: https://github.com/mathieuchartier/mcm/blob/master/Dict.hpp & https://github.com/mathieuchartier/mcm/blob/master/WordCounter.hpp Look at CodeWordGeneratorFast (Dict.hpp), especially WordCount::CompareSavings & Savings (WordCounter.hpp) for the ranking logic.
    15 replies | 715 view(s)
  • Shelwien's Avatar
    28th February 2021, 22:43
    There was this: https://encode.su/threads/482-Bit-guessing-game Humans are actually pretty bad at generating random numbers.
    21 replies | 607 view(s)
  • mitiko's Avatar
    28th February 2021, 22:40
    Thanks for the tests, but it's not quite optimized yet. I might have forgotten to bump up the block size, so it has done 10 blocks to compress enwik8 and it needs 10 dictionaries instead of 1. LZ probably only generates overhead as well. I'm rewriting the ranking loop now, so it can do ranking with states and also save some CPU with hashtable lookups. I'm pretty busy with school but I'll try to drop it this week. mcm seems interesting, do you know what it uses as a ranking function, I'm a little bit lost in the code.
    15 replies | 715 view(s)
  • Darek's Avatar
    28th February 2021, 22:36
    Darek replied to a thread Paq8sk in Data Compression
    OK, I'll try. Somehow I have some issues with my laptop recently - I've started 5 times enwik9 for paq8px v201 - and even after 3-4 days my computer crashes...
    228 replies | 23678 view(s)
  • Gotty's Avatar
    28th February 2021, 21:47
    Second game: Let's say that you thought of a number between 1-256. I'd like to figure out that number. It could be any number in that range. Let's say it is 13. What do you think how many yes-no questions do I need to find it out? I encourage you to really try to find out the answer. It's free tutoring :-) Take advantage of the opportunity.
    21 replies | 607 view(s)
  • Gotty's Avatar
    28th February 2021, 21:39
    Please follow the game here: Think of a letter between A-H. That is: A,B,C,D,E,F,G,H (8 different symbols). Let me find out which it is. How many yes-no questions do I need to ask you in the optimal case (= fewest possible questions)? Let's say you think of "F". My approach: - Is it between A-D? - No. - (Then it is between "E"-"H". Let's half that range then.). Is it between "E"-"F"? - Yes. - (I have two choice now: either "E" or "F".) Is it "E"? - Yes. - I got it! It's "E"! How many questions did I need? 3. Could I solve it in less questions? 2 questions? Is there a smarter way?
    21 replies | 607 view(s)
  • hexagone's Avatar
    28th February 2021, 21:31
    You should look at the dict encoding in https://github.com/mathieuchartier/mcm. It dynamically creates a dictionary of ranked words. mcm_sk64.exe -filter=dict -store enwik8 Compressing to enwik8.mcm mode=store mem=6 Analyzing 97656KB , 46591KB/s text : 1(95.3674MB) Analyzing took 2.098s Compressed metadata 14 -> 18 Compressing text stream size=100,000,000 Constructed dict words=32+11904+33948=45884 save=6956257+30898956+3574176=414293 89 extra=0 time=0.064s Dictionary words=45884 size=376.186KB 97656KB -> 59990KB 74774KB/s ratio: 0.61431 Compressed 100,000,000 -> 61,430,639 in 1.38400s Compressed 100,000,000->61,430,671 in 3.544s bpc=4.914 Avg rate: 35.440 ns/B ​
    15 replies | 715 view(s)
  • Gotty's Avatar
    28th February 2021, 21:25
    I don't feel that I'm stuck. I'm posting improvements quite regularly. Why would you say that we are stuck? Data compression is a technical challenge - not a philosophical one. So we address the issue from technical point of view. If you think there is a better approach... show us. But it should work ;-) Your issue is that you would like to understand why compressing a 2-digit file is worse than simply converting it to binary. This is what you asked. This is what I explained. Regarding what the issue is - I think we are on the same page. No, it's not. The 33,421 bytes could be compressed to 4-5K, it's clearly not random. The 4096 bytes are random. From (general) compression point of view these two files are totally different. In your head (and in my head) the binary file is equivalent to the 2-digit file (and it's absolutely true from our viewpoint), but they are not equal for the compression engines. Please read my explanation again above. It looks like my explanation didn't get through. I tried my best to make it clear - it looks like I was not very successful. Please read my posts again and tell me where it is difficult to understand. I'm ready to try to explain it better. No, they will never beat conversion. The entropy is 4096 bytes in each of your case. That's how low you can go no matter what. Regarding their information content, they are equal - you are right! But "technically" they are different: the 2-digit file needs to be actually compressed to get to (nearly) 4K, while the 4K file is already 4K - nothing to do from compression point of view. You cannot beat 4K. That's the limit. What do you mean by "conventional excuses"? Is it that you hear similar explanations again and again that random content is not compressible? Because it's not. You experienced it yourself. If you read most of the threads in the "Random Compression" subforum, you'll notice that most people try converting some random content to different formats hoping that it will become compressible. Like yourself. It's the most natural approach, it's true. So everyone is doing it. But everyone fails. Do you think that there is some law behind it? There is. You have still the same information content (4K) in any format you convert the file to. If you convert it to any format that is actually compressible (like the 2-digit format), you just give file compressors a hard time, because they need to do actual compression to go down to near 4K. What they see it not a random file - the 2-digit format is not random. You mean 256 symbols and 2 symbols (not digits). Your intuition is correct: it's easier to work with 2 symbols than 256 symbols. This is how the paq* family works. But it does not mean that a 2-symbol file is more compressible than a 256 symbol file. From information content point of view they are equal. It's surprising, I know. In my next post I'll give you an example, maybe it helps you grasp what information content is.
    21 replies | 607 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:19
    I just curious how much Silesia can be compressed using the new mixer context set in each predictor...
    228 replies | 23678 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:09
    The base of paq8sk is paq8pxd. @darek, could you test it on Silesia using the same parameter like you do when you test paq8pxd ??
    228 replies | 23678 view(s)
  • xinix's Avatar
    28th February 2021, 20:24
    Does compression seem to be enabled now? I would like to test only the preprocessor enwik8 lzma - 23.6 MB (24,750,737 bytes) enwik8.bwd lzma - 34.7 MB (36,438,877 bytes) ___ enwik8 PAQ fp8 - 18.6 MB (19,566,727 bytes) enwik8.bwd PAQ fp8 - 29.9 MB (31,357,466 bytes)
    15 replies | 715 view(s)
  • xinix's Avatar
    28th February 2021, 19:37
    I tested this on enwik8 95.3 MB (100,000,000 bytes) Processing lasted 5 hours on AMD Ryzen 5 3600X enwik8.bwd 83.7 MB (87,829,234 bytes)
    15 replies | 715 view(s)
  • Darek's Avatar
    28th February 2021, 18:46
    Darek replied to a thread Paq8sk in Data Compression
    @suryakandau - which version is an base of paq8sk48 - paq8px or paq8pxd? OK, i Know. paq8pxd is the base
    228 replies | 23678 view(s)
  • Trench's Avatar
    28th February 2021, 18:26
    4096 in its converted state which yes I agree. But the unconverted 33,420 bytes file is also a mess but compression fails do beat out conversion. It does get much better results without the newline. ;) But it still compression still loses to conversion from what I see which is the main issue of this post. from 31.9 compressed to bzip2 4.29, lzma2 4.11, p12 4.39, and converted 4.00. I assume better results will show when the file is much bigger since metadata is saved in the compressing programs, but the issue is can it beat conversion? maybe no. But it should be technically equal. So overall all your comments are right but again its not the issue. Can you beat 4.0 kb in theory without conventional excuses? You can not compress 256 digits as easily as 2 digits. Sure 256 is faster but 2 is more orderly. take food for example try chewing an entire apple in your mouth than a fraction of it. the person with a fraction of an apple will do better. The topic was not about compressing files form binary format but binary digits. Everyone in this forum thinks too technical which seems to be stuck in the Chinese finger trap which is not helping matters and progressing nowhere fast. But if everyone feels I am wrong....ok. Thanks again
    21 replies | 607 view(s)
  • Gotty's Avatar
    28th February 2021, 17:09
    I'm not sure what you mean. You don't like the word "Random" when used for incompressible data?
    21 replies | 607 view(s)
  • mitiko's Avatar
    28th February 2021, 16:52
    Yes, it's a transform. The new ranking function is based on it being followed by entropy coding.
    15 replies | 715 view(s)
  • Gotty's Avatar
    28th February 2021, 16:21
    Additional explanation: Unfortunately most compressors will try finding longer patterns in the ascii files. And indeed there are patterns and they seem to be useful. They think that it's a text file and looking for string matches is a good strategy. Unfortunately it's not. The best strategy would be to simply convert the file to binary. But examining this possibility is not programmed in them. So they go for the string matches... Why looking for patterns its not a good strategy in this case? (A simplified example follows) Imagine that you are a compressor and you are given the binary file (4096 bytes). There are *very* short matches (max. 2 bytes only) and you'll see that these short matches won't help compressing the file. So you won't use them. Good. Or you can use them, but still it won't help compression. Now you are given the 2-digit ascii file. And you'll see that you have for example quite many 8-character matches (since the bits are bytes now). Oh, that's very good - you say. Instead of storing those 8 bytes I will just encode the match position and length and I'm good - encoding the position and length is cheaper (let's say it's 2 bytes). So you win 6 bytes. And this is where you were led astray. In reality you encoded a 1-byte information to 2 bytes. That's a huge loss. And you thought you did good... So an additional problem is that when seeing the ascii files many compressors are led astray. They think finding string matches is a good strategy. They will not find or not easily find the "optimal" strategy (which is an order-0 model with equiprobable symbols.)
    21 replies | 607 view(s)
  • Gotty's Avatar
    28th February 2021, 15:04
    And here is the answer: The original binary file (4096 bytes) is a random file, you can't compress it further. The entropy of the file is 4096 bytes. Any solid compressor will be around that 4096 bytes + they will of course add some bytes as filename, file size and some more structural info. Those compressors are the winners here that realize that the file is incompressible, and add the smallest possible metadata to it. On the other hand, an ascii file consisting of 2 digits (representing bits) or 16 hexadecimal characters (representing nibbles) is not random. They need to be compressed down to 4096 bytes (from 32768 (when bits) and 8192 bytes (when nibbles)). Their file size (32768 and 8192 bytes) is far from the entropy (4096 bytes). We actually need compression now, so file compressors will do their magic: they need to do prediction, create a dictionary, etc. So actual compression takes place. They need to get down the file size from 32768/8192 to 4096. So even if the content represents a random file the actual content in these ascii files is not random: the first bit for example is always zero (being an ascii file). In case of a 2-digit file the ideal probability if a '0' character is 50%, the ideal probability of a '1' is 50%, the ideal probability of any of the remaining 254 characters is 0%. (That's why it's important to remove those newlines from xorig.txt otherwise a compressor needs to deal with them.) A compressor needs to store all this information (counts, probabilities or anything that helps reconstructing the ascii file) somehow in compressed format additionally to the actual entropy of the original file (which is 4096 bytes). That's why you experienced that the results in this case are farer from the ideal 4096 bytes. The winner compressor will be the one that realizes that the best model is an order-0 model with equiprobable symbols. The sooner a compressor gets there the better its result will be - meaning: the closer it will get to the desired 4096 bytes. So always represent a (random) file in it's original form, translate the file to 2-digit or 16-char format only for visualization, don't try compressing it.
    21 replies | 607 view(s)
  • xinix's Avatar
    28th February 2021, 14:12
    Thank you! Can this be used as a preprocessor?
    15 replies | 715 view(s)
  • mitiko's Avatar
    28th February 2021, 12:58
    Of, course. There're lots of good ideas, differing in various ways. What I meant by this "Since the creation of the LZ77 algorithm..." is that most programs out there use LZ77/78. They dominate the space for dictionary transforms. I'm only trying to explain my motivation behind the idea, I might be wrong at a lot of my statements but it's intuition that drives me. I didn't know there were that many papers on the topic, I couldn't find many myself. I especially like Przemysław Skibiński's PhD thesis section 4.8.1 as it's the closest to what I'm trying to do. It makes me very happy that we've independently came to the same ranking equation. But I've imporved upon it and I'm trying to develop the idea for patterns as well. I'm trying to fall under some reasonable time to compress larger files, so that it becomes feasible to use it in the real world. Here's an exe, although I'm not sure all of the files are needed. Usage is BWDPerf.Release.exe -c fileName.txt -> creates fileName.txt.bwd BWDPerf.Release.exe -d fileName.txt.bwd -> creates decompressed I've hardcoded a max word size of 32 and dictionary size of 256 for now.
    15 replies | 715 view(s)
  • Mauro Vezzosi's Avatar
    28th February 2021, 12:32
    A choice that I have made implied that is not clearly visible in the source is this: Suppose we have the following input data: abcdefghi and for each position to have the following matching strings in the dictionary: abc bcde cde d efg fghi gh There are 2 solutions with two-step-lookahead: (a)(bcde)(fghi) and (ab)(cde)(fghi). For simplicity I chose the first encountered, but we can choose based on the length or age of the strings (bcde) and (cde), the number of children they have, the subsequent data, ... More generally, whenever we have a choice we have to ask ourselves which one to choose.
    41 replies | 7635 view(s)
More Activity