Page 2 of 3 FirstFirst 123 LastLast
Results 31 to 60 of 68

Thread: A pattern in random data

  1. #31
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    What do you mean you encode random data as speech? Do you mean like saying "one zero zero one zero..."?

    Also, web crawl data is not random. It is highly compressible with ordinary compressors.

  2. #32
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    To clarify what I was writing:

    I acknowledge that it is not possible to compress every file, because in order to compress a file you need a corresponding data sequence to be the result.
    On top of that, there is a requirement for tangible content in the data which carries all the information. This means there is a mathematical requirement to a file length and content to be able to contain all the information.


    The word 'random/any' in my use is the theoretical notion that any file you throw at the software has the highest possible opportunity to compress, and if it does not it is because it is not mathematically possible.

    In order to compress data, you need to give the CPU work to rebuild the file based on shortening repetition.
    In order to compress a file, you must be able to adhere to detecting compression in the first place - you need to compress data with possible methods to do it.

    My interest is to determine all methods to detect compression. When it comes to the common term 'random' data, looking for 8 bit strings is very weak. You can look to fake decompress what you can, look for variable strings, look to transform data sections according to the content of the data to attempt to break even with rewriting sections with existing prevalent information, you can look to even find an approach which incorporates a custom hash with extra data for identification, and more.

    Common 'random' data has a lot of balance in the strings and no prevalence (such as an even % of all 00 to FF). In this circumstance, you can look to do these things above as this is where these are suitable to what they are meant to accomplish as a prepping process.
    These approaches are suitable by nature to attempt compressing. This is still not guaranteed.


    My point was to combine all methods and what they are suitable to accomplish in breaking even to compress different data arrangements to get what is possible (in as long as it takes, although taking as many shortcuts as can), and if you drop any file in the world into the software it should have the highest opportunity to compress if it is even possible. Certain file arrangements require some methods which are not in common compression tools and can certainly compress if the methods are programmed, and is impossible without.

  3. #33
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    You could have an encoder which uses 1-bit to say if the file begins with a 00, stripping the 00 if detected ... would compress 25% of all files, while increasing 75% of them by 1 bit.

    What I've found is some random files can be compressed, but by the time you really get down to very random data, it pretty much becomes `random chance` as to whether or not your algorithm can compress. Then you `accidentally` compress some files that happen to match the algorithm, while the rest don't. And you can't predict it any better than that.

    So `by chance` random compression is possible, but not controllable/predictable... making it not as useful.

  4. #34
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by ImaginaryHuman View Post
    So `by chance` random compression is possible, but not controllable/predictable... making it not as useful.
    I'm really scratching my head on statements like this. What you say does not really make much sense. A "file" is never random. A "random source" is, but that's only a model, an abstraction, to allow discussion of the subject matter. Thus, one can clearly not compress "random sources" (with uniform zero order distribution, to be precise). A specific file is only a realization of such a source (hopefully, depending on how well that fits). Under such files, you certainly have files that you can compress under the model assumptions (the famous 1000 monkeys that generated Shakespeare by pure chance) while on others, you have bad luck. But that is pretty much what the Shanon source coding theorem is about: It does not tell about "specific files", but "compression on average". And the "monkey drama" generated by pure chance is compensated by all the other junk the monkeys tend to produce, so you gain nothing. IOW, if you transform your data first, you're only using a different model assumption, making a different class of files "nicely compressible", but that is of course overcompensated by other files that are not compressible. If your transformation is invertible, it will make some "bad files" (incompressible before) "good files" (compressible), but at the same time, make some "good files" from the previous model "bad files" in the new model. There is nothing to gain. However, that does not make such transformations useless if they fit well to the source data. Think about the DWT or DCT transformations for imaging. They take some "bad files" (natural images) that are not very compressible as zero-order entropy data, and makes them compressible afterwards. What is usually ignored is that it also makes a lot of compressible data we had before less compressible afterwards. Just that nobody cares about such failure cases because they do not represent typical cases a human observer cares about. "A file compressor is a file expander with interesting failure cases."

  5. #35
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by ImaginaryHuman View Post
    You could have an encoder which uses 1-bit to say if the file begins with a 00, stripping the 00 if detected ... would compress 25% of all files, while increasing 75% of them by 1 bit.
    A file starting with 00 is not 25% of a file, because there are 256 combinations of 00-FF which makes it 0.4%.

    Quote Originally Posted by ImaginaryHuman
    What I've found is some random files can be compressed, but by the time you really get down to very random data, it pretty much becomes `random chance` as to whether or not your algorithm can compress. Then you `accidentally` compress some files that happen to match the algorithm, while the rest don't. And you can't predict it any better than that.
    I thought of using a random file generator to try and make the file you need, and there is no point unless you have a way to identify the exact file you are looking for. By giving data to identify the exact file you are looking for, and ideally the least possible, that data IS your compressed file.
    Also, there is really no 'random chance'. Mathematics are bound to a definite structure, and for whatever your compression algorithm can do there is an exact formula to determine the exact percentage of all files out there which it can target. If it is targetting the 1 bit in the beginning like you mentioned, you can determine that it will approximately compress 0.4% of files for example.


    Quote Originally Posted by ImaginaryHuman
    So `by chance` random compression is possible, but not controllable/predictable... making it not as useful.
    I know that with some methods of compression I have determined and their application, they can compress very specific strings of data to the smallest they can ever possibly get. This includes a math approach which is flexible to include addition/subtraction of data around (and ultimately can shorten to multiplication/division and more where possible), which is suitable to certain types of data.
    You need to have an approach which suits the data you want to compress.

    If you wanted to make 'macros/templates' of situations, even if it is based on chunks/blocks of 32KB, where you have different situations that show 'spot 4 is 00, spot 8 is 01, spot 12 is 02' for example, and have a whole bunch of varying macros/templates to possibly identify 'random data', you could do some compression.
    The amount of macros/templates in that 32kb chunk must be less than 32kb worth in order to represent the number/id of the macro/template or else consider that a very very large (exponential) amount of macros/templates will have an ID in the 32kb length bracket than 31KB and below combined.
    Also, if you want to even make some approach to do it, to have a vast list of macros/templates for 32KB chunks will make your EXE file of your compressor very very enormous, making it generally very impractical and can't even load into RAM. If you want to keep the enormous macros/templates as a separate file, it won't compress well either because the data will be balanced and not have much prevalence as a result. As nice as it appears if you have many PB+ of hard drive space if you want to go into higher size chunks for the purpose of having the ability to 'compress the random data' more successfully, it just becomes very impractical.

    EDIT:
    Not to mention the amount of computation required even optimized for a 32KB chunk with an enormous set of 'ifs' or continuous scanning alone with a versatile set of macros/templates to identify, which again, makes it impractical.

    EDIT:
    128 bit encryption is immensely long to produce every single 16 byte of data possible for if your IV is a certain key and then the private key is incrementing up, and for one possible 16 byte of data (that petabytes a second is still miniscule).
    The 32KB chunk example I wrote above is being very moderate in the macros/templates you set up.
    Last edited by SoraK05; 19th July 2015 at 20:46.

  6. #36
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    compression of so called random data

    Here is way to compress some bit strings one bit or expand one bit.
    1 bit 2 strings down 0 up 2 2-bit strings
    2 bit 4 strings down 2 1-bit strings up 2 3-bit strings
    3 bit 8 strings down 2 2-bit strings up 6 4-bit strings
    4 bit 16 strings down 6 3-bit strings up 10 5-bit strings
    5 bit 32 strings down 10 4-bit strings up 22 6-bit strings
    6 bit 64 strings down 22 5-bit strings up 42 7-bit strings
    7 bits 128 strings down 42 6-bit strings up 86 8-bit strings
    8 bits 256 strings down 86 7-bit strings up 170 9-bit strings

    and etc 170 down 1 bit down 340 as more lengths add it stay close to 33% compressed by 1 bit
    and 66% expand by 1 bits. This bets 25% compression for the 00 method.

    note insequence 0 2 2 6 10 22 42 86 170 every 2 terms add to next power of 2 that way you
    can make more terms
    Last edited by biject.bwts; 20th July 2015 at 05:57. Reason: got some beer so here goes

  7. #37
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    BTW one last note is that from my researching methods of compression and how they compress specific data to the smallest, and in some light of what bijects.bwts has said about trading one file for another, the research is to look to get files down to the smallest possible size and representation they can with the methods (losslessly). I know for space/time/bandwidth one can scale for it, but to begin with a 'smallest possible output' is my approach to learn how to scale more intelligently from that point.

  8. #38
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    [QUOTE=SoraK05;44476]A file starting with 00 is not 25% of a file, because there are 256 combinations of 00-FF which makes it 0.4%.

    You misunderstood, I was referring to BITS not bytes. 00 or 01 or 10 or 11 is 4 combinations, thus 25% of them are 00.

  9. #39
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by biject.bwts View Post
    Here is way to compress some bit strings one bit or expand one bit.
    1 bit 2 strings down 0 up 2 2-bit strings
    2 bit 4 strings down 2 1-bit strings up 2 3-bit strings
    3 bit 8 strings down 2 2-bit strings up 6 4-bit strings
    4 bit 16 strings down 6 3-bit strings up 10 5-bit strings
    5 bit 32 strings down 10 4-bit strings up 22 6-bit strings
    6 bit 64 strings down 22 5-bit strings up 42 7-bit strings
    7 bits 128 strings down 42 6-bit strings up 86 8-bit strings
    8 bits 256 strings down 86 7-bit strings up 170 9-bit strings

    and etc 170 down 1 bit down 340 as more lengths add it stay close to 33% compressed by 1 bit
    and 66% expand by 1 bits. This bets 25% compression for the 00 method.

    note insequence 0 2 2 6 10 22 42 86 170 every 2 terms add to next power of 2 that way you
    can make more terms
    For this to compres, would I be right in saying the data has to be somewhat skewed toward more occurrences of the shorter strings, like with huffman?

  10. #40
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    One interesting thing you can do with random data that I'm looking at at the moment, is that you can pretty much ignore what the input data is and just randomize it. E.g. add a predictable/re-generatable pseudo-random value 0..255 to whatever the original byte was, modulo 256. Now you have a whole new set of data. You can now use the `generator` in multiple attempts to `create` data that is `more compressible`. For example you can make it increase the number of duplicate values by making multiple passes of randomizing until you get the best version. Or maybe create data that has as many duplicate values as possible, or as many 0's, or as many values below 16, or whatever. I've found this can help a lot, but still is generally not enough to `always compress every file`. So now the challenge is to create a random pattern generator which happens to create data which is more easily compressed.
    Last edited by ImaginaryHuman; 21st July 2015 at 19:42.

  11. #41
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by ImaginaryHuman View Post
    For this to compres, would I be right in saying the data has to be somewhat skewed toward more occurrences of the shorter strings, like with huffman?
    All it shows is that about 1/3 of all bit files no matter how long can be compressed one bit. While 2/3 of all files expand by one bit with this method. However looking at the above method the 2 possible 1 bit files double in length to 2 bits. The longer the lengths of bit files you have as input the more apt if one picks at random to have a 1/3 chance of getting a compressible file.

    Obviously this is a lousy compressor. I think its safe to say any general compress that is good on real files would suck on file from random sources of data. Since again compress is nothing but changing a short file for a long one. It your trading alot of random long files for a shorter files your reducing how well you can compress useful data.

  12. #42
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I presume then that it would take more than 1 bit of data to randomize the file's byte values in order to `turn it into` one of the files that can be compressed, given you probably need to do that 3 times (on average) to come up with a compressible file.

  13. #43
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by ImaginaryHuman View Post
    I presume then that it would take more than 1 bit of data to randomize the file's byte values in order to `turn it into` one of the files that can be compressed, given you probably need to do that 3 times (on average) to come up with a compressible file.
    The example was for bit files. Where you compress 1/3 of files and expand 2/3 of files by one bit. Following same procedure if you allow one bit compression and several bits expansion you can compress very close to 50% of files one bit.

    Sadly when you move to bytes files it is much worse. When you have bit files the number of new files as you increase the bit length is twice the number of files that are exactly one bit smaller which cause the 50% of files compressed as a limit. However in byte files there are exactly 256 times as many at files N+1 byte long than at N bytes. This puts a hard limit of .390625% much less than one percent.

    Of course you could get lucky and compress a so called random file first try. If you have time you could also tune it compress almost any so called random file. The chance is not zero but so what! The compressor you make would likely be useless for real files of value.

  14. #44
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Now you're talking about the BARF universal recursive compressor.
    http://mattmahoney.net/dc/barf.html

  15. #45
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    85
    Thanks
    7
    Thanked 18 Times in 11 Posts
    I think there is a way to compress most random data. And I know my entropy

    A while ago I red a book about data compression stating that in theory most of the data that exists in the universe is not compressable. It might be so, but why do I rarely see real random data? Why is it that hard to create real random data?

    Lets take phi as an example: if you try to compress the number phi in a classic way you will find out it is incompressable. If the algorithm is smart enough to see the pattern behind the pattern, it will be able to describe phi and therefore compress the data.

    Another example: The million digits file. I have red somewere that the set was made using 10.000 diverend simple algorithms and mixing them altogether. If the algorithm is smart enough to detect the subproblems and solving the equation, it might be able to produce an algorithm that is capable of reproducing the file while the description of the algorithm is smaller than the resulting dataset.

    Just like Bulat already said: its not about the output, it's about the source. How many information is needed to describe an algorithm/program reproducing the data you see?

    The current AI algorithms are not capable to solve such problems. Maybe by accident, but not by design. They all dive in the nearest valley to satisfy their loss functions while not caring about the reasoning behind the data.
    Universal intelligence as described by Marcus Hutter would solve this AI problem, but unfortunately it's just trying an infinite number of valleys.

    So what's next?
    I personaly believe in genetic programming as a way to solve these problems. It should not have a classic loss function, but something new that is capable of telling the algorithm that a new found pattern is not solving the problem itself but has interesting properties. That might lead to the conclusion that an unknown sub problem is solved. A group of solved subproblems eventually leads to a better sub problem and a tree structure of those sub problems eventually leads to a solved problem. The trick is to find a solution that involves the smallest tree of sub problems. The smallest description of the problem is the most likely solution.

    And that again is very close to what Marcus Hutter tells us. The diverence is that we don't have infinite time and space on our hands, so we have to find a way to divide the big problem in smaller problems. But how to detect the subproblems?

  16. #46
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    different people/programs have different views of "random".
    for example:
    Code:
    00000000: e6 95 b0 e6 8d ae e5 8e 8b e7 bc a9 0a
    you may think this is a random sequence and cannot be compressed. in fact this bytes are UTF-8 encoded Chinese characters which means "data compression", we can compress it by converting into GBK encoding:
    Code:
    00000000: ca fd be dd d1 b9 cb f5 0a
    so there are no real "random" data in the world. if you got enough knowledge, you see everything has its rule/pattern.

  17. #47
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    85
    Thanks
    7
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by RichSelian View Post
    different people/programs have different views of "random".
    for example:
    Code:
    00000000: e6 95 b0 e6 8d ae e5 8e 8b e7 bc a9 0a
    you may think this is a random sequence and cannot be compressed.
    In your example I see a lot of hexadecimal 'e' values and also the leading zero's. It is already compressable.

    But if you 'deflate' it to a version without the obvious quick wins, the smallest program to describe the outcome might well be the outcome itself. So in theory there are many examples of uncompressable data in the world. They are in fact not random at all and maybe well readable. But they are just a description of something without a known possibility to describe it with less information. (And there we have Kolmogorov complexity.)

  18. #48
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Quote Originally Posted by Kaw View Post
    In your example I see a lot of hexadecimal 'e' values and also the leading zero's. It is already compressable.

    But if you 'deflate' it to a version without the obvious quick wins, the smallest program to describe the outcome might well be the outcome itself. So in theory there are many examples of uncompressable data in the world. They are in fact not random at all and maybe well readable. But they are just a description of something without a known possibility to describe it with less information. (And there we have Kolmogorov complexity.)
    we say something is "random" because we cannot predict it. if we are enough smart, we can generate the whole world by the first atom of the big bang or compress the whole world into an atom?

  19. #49
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I agree to there being no such thing as `completely random` data. There is always some meaning, although it might be very subtle and very hard to identify. Maybe we should just call it `highly random` or `highly unpredictable`.

    That said, good luck coming up with an algorithm that actually works, especially one that always works for all files.

  20. #50
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    From my research on collecting methods for compression, there are certain elements one can use. This is by no means all, but this is general.

    *This is to reiterate for ImaginaryHuman that you can only compress what a computer is able with its available elements, and there is a need for a slot to fit the data with every part of data mathematically retained (meaning 'any/all' is impossible).*

    Data consists of 0 and 1 bits, and can consist of any length of these. A computer is capable of mathematics (addition/subtraction and other bit manipulation), and able to read/write and specific location. This includes all logical and arithmetic applications the CPU provides.
    Based on this, and the concept of looking for some form of repetition/prevalence to record in less bits and use the CPU to transform it, you are able to use these elements.


    There is a hierarchy of elements and methods which are interlinked.

    You can look for strings as is (variable/static), measure the gaps between 0 and 1 bits, and nCr ('ID' of the data) as a base of reading data. You can also look to seek specific areas based on 2 bit patterns of 00/01/10/11 and shift left/right depending on the string sequence, and other seeking methods depending.

    From here, you can apply mathematical formulas (adding/subtraction) in conjunction to distance, and then also check for elements regarding distance (things in a row or a variable/static length apart). You can also apply what I call 'circumstantial' using processed data to find a pattern which can be sought for like an emulator with a database of 'ifs' on each coded result. You can apply static/variable approaches to reading the data in different orders (right to left, every odd, custom etc). It is possible to expound the basis of the nCr ('ID') approach such as creating a custom hash with additional information which can rebuild the file based on it (with minimal brute determination). It is also possible to 'implant' data like a ppf patcher which can overwrite/implant data and also include the different elements like distance and others.
    This above is considered one 'realm' of the process.

    The data must be coded, using bits of 0 or 1. This process can be used in conjunction with the above as the data required to further process, and is interlinked as another 'realm' of coding. You can use static length bits, variable length bits, or 'incremental' from 1 to n bits and produce a log of each length of the bits which is compressed as a companion. You can then also apply mathematic formulas during the encoding, or circumstantial, or even things to do with distance to further create more complex coding, as well as add multiple code trees which can be identified which are specifically suitable to specific sections of the data. There is also the notion of looking to compress headering where possible.

    Another 'realm' which comes is the process of 'reversing the coded/compressed data'. This is linked to the notion of scaling for space/time/bandwidth. The 'random' looking data which is generally balanced of all 00-FF bytes and dispersed belong to a % of the file spectrum. I like to tentatively think there is 50% of data consisting of all iterations of data using 1 value of the 2 bits of 00/01/10/11, 2 values, 3 values, and 'all compressible versions of 4 values'. When compressing data with a weaker or stronger compression scheme both result in data which is 'uncompressible' and appears 'random looking'. There is however enough repetition to latch onto in the weaker one to be able to expand it and compress it to something smaller, since the coding and the spacing isn't as tight as can be.
    These kind of arrangements of data can be looked through for patterns to attempt to fake extract and rewrite sections <pseudo encryption> (dynamically to accommodate what is in the file).


    All the main basic elements enabled by the computer and its mathematical capability can all be taken advantage of and interlink dynamically in an exact way which can be predefined for boundaries for minimal scans. You cannot compress data unless a method the computer and its elements allow for this, and to break even after creating a header and coding (which requires a mathematical % of repetition/prevalence in the data).

    *
    Every method the computer can calculate has file sequences which will compress that data to the smallest form possible (in the world). By using all methods together and dynamically as possible, as well as with predetermined boundaries to make the process more efficient, you are giving the highest possibly to squeeze a file to as small as it possible can.
    *


    The points here is there is some data arrangements which will not compress unless it has a method which suits it. By having 'ALL' methods and as dynamic as can be, you (practically) squeeze what is available.
    I am not saying this is every method and I have not determined all boundaries, although I know specific amounts of bits in a section for its ratio and the length can play a part, as well as placement and advance calculation of clashing methods for example. I am looking through all methods, and have in mind some theoretical formula which can perfectly identify all methods are accounted (i.e. there is a % of data and ratio compressed with multiple methods together which can cohesively confirm it is all methods).
    I do also theoretically account for a 'fourth realm' which involves an identification of every single 'uncompressible data' left after all the above in the 'nCr ID' of the file spectrum and can be taken advantage of for the final squeeze.


    Data has meaning - it has 0 and 1 bits, and a length. It is entirely random however, accounting for every possible iteration of any file in the world of any length.
    Depending on the prevalence/repetition the computer and its elements can capture, that is what is available to be found and squeezed and 'rewritten' into an appropriate 'slot' in the file spectrum which doesn't compress - you need to transform a compressible string into an uncompressible string, and I theoretically presume there is a perfect formula for correlation when all elements are accounted.

    There is nothing about random or prediction when you account for every element the computer is able to calculate with, and more about what can be detected in the data you present after all the above is considered.

    When all the above is collected, it is much simpler to determine scaling of space/time/bandwidth constraints (in the world) and even potentially recreate the same compression results of other algorithms if one can enter custom entries into all the possible elements/variables since the software will have all methods and dynamic.



    EDIT:
    Also to RichSelian.
    Comparing the difference of character types like chinese/gbk requires some form of database for reference. The database takes space which is in a way self defeating.
    Compression without some kind of database to isolate very specific cases to something else relies on the mathematical approaches above.
    There is a way to generate some generic database to a degree (beyond what a code tree header does), but to have something specific to take a large set of files and write them as something small defeats the purpose since you need that data in the database.
    A database is not for general files and is for specific and custom tailored data, which can be self-defeating from the size of the database depending on the usage of the data it references.

    There is no general algorithm to transform chinese-gbk. This is custom tailored using entirely custom written correlations. Creating a 'database' to accompany a code tree header like traditional compression with these custom written correlations of chinese-gbk is self defeating. You need the methods mentioned above (and elements) to identify the chinese data otherwise.
    Last edited by SoraK05; 22nd August 2015 at 15:36.

  21. #51
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I did notice that SOME aspects of random data would be compressible, IF other aspects were not present. Sometimes there are short repetitions. Sometimes values that ascend in order are interrupted by values which descend, but the ascending groups alone do compress. As always though, representing the parts that don't compress typically takes more data than is saved. Its interesting though, maybe if there's a way to isolate or locate 'compressible' parts and only deal with those. After all, just because we had a big file we don't have to treat it like one file, we can pretend it is 100 files, and ignore trying to compress some of the files and work on compressing the ones that do?

  22. #52
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    The human brain is good at spotting patterns in random data. It doesn't mean anything. The overhead of representing which parts are compressible will cost at least as much as you save. You will save time if you work on problems that haven't been proven to be impossible.

  23. #53
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    One method I found is to search for ranges of data using 1,2,3 or all 4 2 bit combinations.
    One can search for all data which has for an example 01 or 10 for a '2 combination' setup, and if a string is continuous in a row for having 01 or 10, it is marked.
    This can pick up, providing enough length to record in the overhead, a minimal amount of data to compress. All data which belongs to the 1,2 or 3 combination pattern with enough minimal length can always compress for example. It is more of a hit and miss based on enough prevalence for a minor percentage of the 4 combination, where generally the 4 combination from being exponentially bigger contains the 'uncompressible spectrum'.

    However, knowing the different methods to compress, one can expand further than the scan of 'in a row' and implement more approaches to attempt to very specifically isolate only known compressible data. This is a technique I do consider, such as finding a lot of repetition in a distance and a row (like every 20 spots for 200 times) and there is no other general compressible area. One can take advantage of using offset markings in the code tree or even specific codes in the code tree when reading left to right to tell it to switch from 'raw' to 'compressed data' when reading and associate a 'length' as to when the switch occurs next.

    There are definitely ways to be more specific and isolate what you can (already in my notes).


    The thing ImaginaryHuman, is that even though with my notes you can compress files much more than currently done (even with basic elements like variable strings, clones, in a row stuff), they will require a lot of computer resources to do. It may seem redundant to have a tool which compresses common files from 500mb that end up as 200mb to even 30mb if it takes much longer to do it and a lot of free storage space to perfectly assemble it.
    I am one for having a large storage buffer to do what is required to have a compact result, and leave it going taking its time (with no counters / gui in the software to reduce the processor time by a large factor). Video for example and audio can compress much smaller using some of the stuff in my notes, but it is considered impractical (even though I tend to disagree with the 'initial compress time' for an easily decompressible result that is a lot smaller).

    If one were to go to a lower level like ASM to really have a quick compression algorithm using some of the stuff from the notes, there is already info to compress stuff a lot more than common compression out there (and to quite small degrees which may come of as surprising).
    This is an absolute must to get the most out of the limited benchmarks out there (and even what is mathematically and scientifically possible).

    The overheads must be considered in all possible techniques and methods to get what you possibly and technically are able from the general file spectrum. Expectations from common everyday files however is something else, and may be considered redundant to go to such lengths (unless it is generally encrypted or somewhat obscure data).

  24. #54
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    179
    Thanks
    61
    Thanked 51 Times in 40 Posts
    SoraK05,

    Matt gave a definitive answer: "The overhead of representing which parts are compressible will cost at least as much as you save."
    Let that sink in before you spend anymore time on this.


  25. #55
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I have been spending time on approaching all methods there are.

    I am aware of including your overheads. I am aware of even looking to minimize your overheads by compressing the overheads themselves where possible, and in multiple ways.
    I am aware of this, and that my response was ultimately including all these things that is 'the best you're gonna get'.. and the likelihood of this type of data being encountered is only encrypted / obscure data in day to day files anyhow.

    For other types of data, all this 'squeezing where possible' just gives that file an extra squeeze and is data that can compress to begin with. These other files in the file spectrum go to show that they are 'uncompressible' because these methods don't work, and are defined as 'uncompressible' because all these compression methods used to identify data will not work in the first place.


    It is a fundamental rule of compression. You have multiple approaches to compressing data based on the multiple operations of logic/arithmetic a computer has to offer, and based on the context of the data some will work, some will not even be able to be used from lack of detection/application, and some will absolutely not work. 'Uncompressible data' is technically the data that these methods will not work on (and is a part of the pie relating to what Matt wrote). It doesn't mean that the methods will not work on other data where it does actually suit, in the 'compressible range'.


    There is data that can benefit from some methods of finding 'what you can' where able to break even, and even in everyday data. I also mentioned reasons of efficiency from time/space and general day to day use to be considered too. This can work on particular EXE files where on the whole they are 'uncompressible' with conventional methods (to fair degrees) but localizing specific sections can compress further and break even, and other possible DLL / DAT data as examples.

    For someone looking to compress things to the smallest, all these things will add up.
    Last edited by SoraK05; 6th October 2015 at 16:44.

  26. #56
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    697
    Thanks
    154
    Thanked 186 Times in 109 Posts
    Quote Originally Posted by SoraK05 View Post
    It is a fundamental rule of compression. You have multiple approaches to compressing data based on the multiple operations of logic/arithmetic a computer has to offer, and based on the context of the data some will work, some will not even be able to be used from lack of detection/application, and some will absolutely not work. 'Uncompressible data' is technically the data that these methods will not work on (and is a part of the pie relating to what Matt wrote). It doesn't mean that the methods will not work on other data where it does actually suit, in the 'compressible range'.
    It is a fundamental rule of compression that if you find a method that works (on average) for a type of data then the data is not random.

  27. #57
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Quote Originally Posted by SoraK05 View Post
    For someone looking to compress things to the smallest, all these things will add up.
    you are right. your method will compress each 256th file by 1 byte and there are a lot of people that will be glad to spend hours waiting for such result

  28. #58
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    It is about the unbiased science for what compression can do and methods available, and what can be done with it (ideally for contributing).

    EDIT:
    Also the method being targeted recently is that of 'compressing portions' and having a mix of compressed/raw. This is something to consider along with other methods. Just thinking of what can be done on data in different arrangements and the instruction set of a PC, as well as time/space/bandwidth etc.

    It is easier to see what goes on when all is written.

  29. #59
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    the result is easy to predict using simple math. if the compression method doesn't use specific data properties, it can compress by N bits with 2^-N probability. unfortunately you don't know math nor progamming so you can't check iit yourself in any way

  30. #60
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    What I know is just like there is a percentage of data in the spectrum which can compress a whole file when targeting strings and data in a row, or a percentage including data which also targets data in chunk lengths, if you add the ability to intentionally search for 'compressible segments' that contain these data forms and to leave the rest as 'raw' then it has a percentage as well. If you allow for a variable coding to have 'static' and 'dynamic' length codes in a file to suit data patterns, you also increase the total percentage of data you can compress in the spectrum.

    I know that by increasing the amount of methods in variable degrees with each other, you ultimately increase the percentage of data you can compress all around with all these methods complimenting this. Each method and their combinations have their own percentages they contribute, with some relation to CPU steps and the nature/complexity.

    I am familiar with pseudocode and basic programming, and I have had some experience with math just so you know.
    With all elements written, it wouldn't be a very strenuous task to determine percentages, or even determine general boundaries/priorities where they meet (from what I can see).

    I think I've written quite enough.
    Last edited by SoraK05; 7th October 2015 at 09:10.

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. Replies: 41
    Last Post: 6th May 2016, 20:13
  2. Specific case - High redundant data in a pattern
    By Gonzalo in forum Data Compression
    Replies: 12
    Last Post: 19th September 2014, 06:39
  3. Euler's Number Triangle and Random Data
    By BetaTester in forum Data Compression
    Replies: 55
    Last Post: 19th February 2013, 05:02
  4. Sometimes data look like random... here's an interesting file:
    By Alexander Rhatushnyak in forum The Off-Topic Lounge
    Replies: 29
    Last Post: 25th December 2010, 04:05
  5. Random Data Question.
    By Tribune in forum Data Compression
    Replies: 7
    Last Post: 13th June 2008, 20:30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •