Page 1 of 2 12 LastLast
Results 1 to 30 of 46

Thread: Ultimate one-click compressor ideas

  1. #1
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post

    Ultimate one-click compressor ideas

    I wrote a thing to do with the concept of a one-click systematic and automatic approach to calculate a file of its contents and available prevalence in order to transform the file into the smallest possible form including multiple algorithm approaches concurrently. Something like an 'ultimate compression tool' which can also scale for its options. This is based on different approaches to detect prevalence and record it being more suitable for a 'smallest result possible' on some data streams than others, and can work concurrently.
    Still a WIP with some work to go into the finer 'heavy' details.


    There is also something I wrote before to assist as a headstart / read.
    http://encode.su/threads/2006-Hello-...e-an-algorithm
    I am yet to include a means to scan for 'heaviest chunks/patterns in set gaps 'in a row'' and the layered counting for this. However it is there for some information covering majority about the strings including minimal variable string scans (as opposed to 8 bit and frequency) for the smallest result.



    EDIT:
    In regards to the compressible stream mentioned in the thread, I was working on a coded structure for 3 patterns that was inefficient and this writeup has a coded structure to always compress with one pattern being 1 bit and the other 2 as 2. It has some mention of general expectations for patterns and slots.
    Attached Files Attached Files
    Last edited by SoraK05; 14th September 2014 at 21:57.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I suggest you analyze some actual benchmark files like enwik9 or Silesia and estimate what the compression will be using your ideas. It probably won't be as good as existing algorithms like zip, but I would be happy if you proved me wrong.

  3. #3
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I made a sort of test file of a BMP and removed its 56B header which was 8 colours and mostly repetition to get an idea of what is possible with current tools like 7z/rar/png. Of the 2.25MB image which was mostly repetition, PNG was 5KB, RAR was 2KB, and 7z was 1KB (pretty much with and without the headers). I've mentioned this in the past. These tools are known to be weaker for space/time. Zip falls in the same bracket. These use chunks and some variations of looking at patterns such as each byte/8 bits and frequency and minimal checks for repetition of patterns in sequence such as sliding window.

    The string thing in the writeup looks for the minimum pattern in a constrained scan unlike 8 bits and frequency, and treats repetition in a row as its own category with specific attention to capture this as a number, among more like internal repeating. It should also be more specific to areas to process individually to be careful for areas which are more compressible than others as opposed to getting a global result of frequency which can lose compression where some areas can (on the file overall for a smaller result, which isn't generally present in things like 7z/rar/zip as well not do things in chunks which can build up a bunch of headering data and increase a file but use generous temporary storage).
    The result without the header with an internal repeat should be 50B from a lot of repetition in a row being able to be recorded and encapsulated of the 8 colours repeating in the same pattern. The example image should also be a decent bench for getting repetition that occurs in chunks in a row, such as set spaces and not directly in sequence in a row.


    The idea behind the strings thing was to have the smallest/best possible result from looking at strings and its representation (which can stream in time), and is meant to be more specific than 7z/rar/zip (as specific as it can). There's nothing yet for repeating patterns at a specific gap length apart which repeats in succession, but with this it should be quite complete for a string specific method (with the counter layers will not generally be streamable for video/audio on lower cpu).
    Coupled with looking at the lesser polarity gaps for few entries of data relating to distances of occurrence should cut a lot of data for files with more prevalence of 0/1 and vice versa and to a larger degree, as well as nCr to take care of the ones with more prevalence, and a means to determine which of these to use in which parts, it should aim to get a file quite to its bone... These are methods in general which cater for the smallest result for specific 0/1 arrangements, and using them together in a more optimal approach concurrently to target a smallest possible result should do a lot to files with prevalence all round and in different arrangements.



    Perhaps once done writing the outline and general details some attempt can be done for coding. I'm not much of a programmer but on paper I can write decent theory.


    EDIT:
    Another relatively crucial thing is finding clones, or the equivalent of once doing 8 bit/1 byte + frequency for patterns you move onto any of these which appear together, such as relevant 16 bit/2 bytes + frequency and adjust your coding to accommodate these clones which are suitable. This allows for appearances of common patterns in the file that appear as clusters to have one entry in general representing the cloning. This can be suitable for a more streamable solution to common cluster things like video/audio by capturing the clones before the 'check for common patterns each frame and use counters'.


    Zip etc does not look for clusters of common patterns to reduce them to a clone identity and a single representation for each occurence which would make this more effective with this part on things like video.


    The approach is to use 'heavy constraints' on the elements to result in a 'smallest possible written file all round' of original prevalent data, such as minimal variable string patterns than 8 bit + frequency, in a row treated specifically and stored as a plain number etc. This is in mind as well as multiple more suitable approaches which work for smaller results on specific arrangements of 0/1.
    Last edited by SoraK05; 15th September 2014 at 21:18.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    An order-0 model on pairs of bits or even on bytes will not compress as well as zip. Theories mean nothing without experimental evidence to back it up.

  5. #5
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Well without programming skills at the moment, and perhaps some specific info on what to technically do, there's an outline for chunk length scans (such as common patterns that appear a specific length apart and repeat in this manner for the set lengths which can be common in video/audio frames) with the scaling options for space/time, and also some info on how to go about distinguishing which method (string/polarity+gap/nCr eg) to use based on ratio of 0/1 prevalence for the stretches identified in the 2 bit pattern area scans.

    Going to take a break before looking to compile something more uniform in a document and any extra details. This should generally be enough for a skilled programmer to figure.


    Also Matt Mahoney, I'm not a programmer but calculating some basic examples by hand like the repetitive constructed BMP I mentioned that was mostly repetition and should compress to very little and other things like 7z/RAR (and zip) and PNG are far more, it shows that there is quite a lot more to capture than what these tools do and even using strings alone with a variable code tree can accomplish some compact data. There are multiple approaches being done in a specific priority on the whole data which have inheriting properties for being automated and with a generally compact result as the target.
    One can generally do these kinds of examples by hand and compare the result to these tools to see.

    I would personally prefer a sound theoretical writeup to go with some hand examples and approach the programming. Anyone is welcome to code.



    EDIT:
    You can take a BMP of 8 colours, 2.25MB 1024x768 and of 4 lines of one colour, and this cluster of 32 lines repeating 24 times. It can reduce to the ~50B it can (and should) be for its strings and variable code tree as opposed to the 1KB 7z/2KB RAR/5KB PNG (with the former 2 on ultra pretty much with and without the 56B BMP header. The example I mention of being ~50B is without the header and making use of repetition of 'in a row' as a number <the same one> and encapsulating very simply with a repeat).


    Apparently DEFLATE can do it to around 50B where the second part after the initial DEFLATE result is compressed for patterns that fit into 5 bits each and a header.
    Considering repetition in the makeup of colours for a minimal code tree database and getting things like the repetition in a row as a direct number, as well as it being simple from its result to layer/encapsulate, it should be doable.
    Attached Files Attached Files
    Last edited by SoraK05; 16th September 2014 at 17:11.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Compressing artificial, highly repetitive data is not very interesting. Maybe you could calculate the compression ratio of calgary/book1 or enwik9 from the distribution of 2 bit sequences. I think it will be pretty bad, but you are welcome to calculate it yourself to prove me wrong. I'm not sure which would be easier, either counting the bits by hand or learning to code and writing a program to do it. But that's up to you. I don't think anyone else will be interested in trying out your untested ideas when there are already far better methods that have been implemented and tested.

  7. Thanks (2):

    Bulat Ziganshin (16th September 2014),SvenBent (18th September 2014)

  8. #7
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    The point is to target specific data and compress it using methods which are suitable for a smallest result for that data. The 2 bit scans is to determine areas to process individually, then apply methods such as strings/polarity+gap/nCr/math and more to do with positions like clones/in a row/chunk lengths.
    It isn't to process them as 2 bits, this is to determine areas.

    I know there is still math recursion to add to this, namely ADD/SUB and MUL/DIV for bit shifting left/right and giving a polarity, and having math recursion can assist scenarios such as chunk lengths where the length decrements at a stable math equation like length--.


    The point is to find prevalence where it is able Matt Mahoney, and not just on random artificial repetitive data. It is to do thorough scans to determine prevalence in the file and their positions, and compress each segment in the most suitable manner based on the makeup, and put the result together.

    The example above of the highly repetitive data is to illustrate common compression tools on such data. This tool should effectively get repetition in a row such as a 1GB file of 0 bits to a string and number in a row, making it the generally smallest form. If there is a 1 bit in there, it should use nCr to get it to the generally smallest form of the nCr iteration ID.
    In a text file it can develop the minimal strings to use to represent things (variable bit lengths), find clones of common letters/patterns clustered, find relevant repetition in a row which is unlikely, find chunk length repetition which is more likely if words/letters appear in common distances, and go for other methods based on the makeup and prevalence on the 0/1 data.

    A 1GB file of 0 bits is generally not going to compress more than a '0' and a number of '1gb in a row'. Same in general goes for nCr used where a 1 bit is in there. These are approaches suitable for a smallest result, and the purpose is to have the file scanned and these separate algorithms used concurrently with a one-click approach.


    There is no 'schemes' for text/images involved and rather looking strictly for prevalence and treating it in the most suitable manner for the smallest result.

    I'm not going to repeat myself. I don't plan to do this by hand which will be long even for a small segment and possible methods as well as there being more to add to this compression model in general, and I am aware that it aims to be thorough in prevalence found.

    The 2 bit area scan is to determine stretches of 1 pattern, 2 patterns, 3 patterns, 4 patterns, and for the purpose of isolating 1-3 | 4 as a basis of leaving the 4 potentially as raw areas and being more specific to offsets than chunk lengths for having compressible/raw data, and this process can also allow for an overall tree to suit the file and the contents of those areas. If a specific area is prevalent in 00 and you just look for frequency in the whole file this will be detrimental if overall, including in the uncompressible areas, this appears less. The file may not look to be compressible but if the area was targetted specifically it can do this area appropriately for its specific offset range. This, and looking at common 00 in other areas to give an overall code tree assigning to be suitable to its distribution and appearance in the file in the compressible areas than just a grand frequency.


    If a single stretch in the file has repetition in a row of a decent amount, common in data files where there are 00s in a row, and even in multiple locations with the same length in a row and have a common entry as a number for in a row, looking at frequency will miss this and not having repetition in a row as a number as effectively as isolating this specific area with the 2 bit pattern scan and treating it for its overall 00 representation and having repetition in a row as a number. This generally will be the smallest capturing. These kinds of things add up to a 'smallest expected file'.


    These are things one can read and discern without coding. A piece of paper with 0/1 entries on it and area to write can discern multiple approaches lead to smallest results - repetition in a row as a number, minimal variable string lengths than 8 bit + frequency, being specific to areas and doing these individually on the side eg.
    This includes use of a generous temporary mapping file to keep up with all the changes and then finally be used to piece the result of all its components together in its final packed form.
    Last edited by SoraK05; 17th September 2014 at 15:11.

  9. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I suggest you read http://mattmahoney.net/dc/dce.html before inventing any more stuff that's already been done.

  10. #9
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Thanks.
    The point is to look at prevalence where it is possible and treat it appropriately for the smallest result.
    With this approach things that repeat for example in a row can get its representation as a number which is the smallest form. Prevalence can be suitable with alternate methods/approaches for a smaller result. Strings can be more suitable for some data than others which will be smaller with nCr for example even though both still compress and reduce the file.
    The idea is to target prevalent data with its suitable approach for the smallest form.

    For example the process for minimal variable strings in conjunction to the area scan gives the smallest possible code tree for plain strings and written form, and compliments towards the position parts like clone/cluster and others for more compacted form overall.

    I wrote a WIP text file of the whole thing, and details for what is already done.
    Math needs some work, as well as the parts at the bottom of this list.

    DEVELOP SPECIFIC TARGET RANGES THAN PROCESS THE ENTIRE FILE
    MODE 0: AREA SCAN

    DEVELOP AN INITIAL PROPORTIONAL SHRINKING OF THE FILE
    MODE 1: STRINGS
    MODE 1: POLARITY GAPS
    MODE 1: NCR
    MODE 1: MATH

    TARGET PREVALENCE FROM POSITION, AND CAN INTERNALLY REPEAT
    MODE 2: CLONE/CLUSTER
    MODE 2: CHUNK LENGTH (INC. MATH)
    MODE 2: IN A ROW


    TO CONSIDER: [TO DO]
    MOVING/SHUFFLING DATA
    [LIMITED]
    FAKE EXTRACTING AND TREATING LIKE POORLY COMPRESSED AND RECOMPRESSING
    FAKE ENCRYPTION TARGETTING KEYS WHICH CAN PRODUCE PREVALENT RESULTS

    All have in mind the smallest representation suitable for the data layout.



    EDIT:
    From my reading, something like text preprocessing in PAQ doesn't make much sense to me in regards to minimal storing.
    Were the text file processed for its string makeup for minimal strings, clones associated, any chunk lengths and in a row found done, this should technically be more specific to suit the file and its exact content with this advance (lengthy/constrained) scanning. Any dictionary words or phrases repeated can be detected with this constrained minimal string scanning and the clones, and otherwise the initial storage of a specific word or phrase would require some minimal form as it is which would involve minimal string/clone/etc?

    Same goes for models to do with BMP/images I have seen where the colours/data can technically be any file as any colour in combinations will technically make any possible file (beyond some things to decompress and recompress or predefined chunk data <which chunk length scan can detect>, similar also with WAV/EXE and so forth?)

    I figure it is more ideal to suit the file for the prevalence it has and the data it has in a lengthy and storage generous period with the raw tools like minimal string scan/chunk length scan and so on.


    PI 22/7 also has prevalence of the process of repetition of division, and while it can be considered somewhat its own exception the contents if it starts from the beginning and the math part is functioning to determine patterns in bit polarity/position to resemble ADD/SUB/MUL/DIV and changes for odd/even it should be able to do something to pick up PI.. among varying math prevalence..

    I also noticed for chunk length scan that there is a clash to iron out to do with any entries appearing in a row which can be eligible for 'in a row' instead...

    EDIT:
    There's also the notion of confirming the merger of acknowledged areas scanned where their type is common.
    Still in WIP.



    EDIT:
    Those two noticed things have been included and added a couple words.
    When more is done like Math I can update.
    Attached Files Attached Files
    Last edited by SoraK05; 1st October 2014 at 20:23.

  11. #10
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I noticed that while the initial temporary mapping file and methods look to minimize total entries and be effective in general, the coding of the entries was not so effective.
    After looking at arithmetic coding for some savings due to prediction of the occurrence of an entry, as well as a notion of having data which can be suitable to multiple static code trees concurrently for example, I deduced that from the layout of the data it is more suitable to have multiple concurrent code tree configurations throughout the file suitable to the placement of data.

    Multiple code tree configurations suitable to the areas while keeping their entries to a minimum which are appropriate and suitable will break even from shorter coding bits all round.
    Using the static example, one area may be suitable for static coding of length 4 bits, and the area next to it can be suitable for a length of 3 bits. By having these with their own code tree configurations concurrently it will reduce the size all round from being appropriate to the areas.

    TARGET APPROPRIATE CODING OF DATA
    MODE 3: STATIC / VARIABLE / INCREMENTAL CODE TREES
    While I don't have a writeup for something truly dynamic at this moment, I have something which is relatively close.
    Using the FRAME SCAN can give a shortcutted breakdown to the general layout of data and suitable configurations to their code type.
    Something dynamic would require checking specific portions and entry for entry to be more precise.
    However, being much more specific to the file with multiple code tree configurations and to multiple code tree types allows for more appropriate bit saving coding.
    This includes compression on the multiple code tree configurations and their placements and also merging configurations with common entries where they will be heavier (or else leave them for their own distribution arrangement).

    A more precise dynamic method remains. This however should be quite specific to the data layout for multiple code tree configurations throughout including methods to merge and compress the configurations for something closer to a more precise dynamic result.


    Math is not done and any text above the bottom with the coding part has not been updated much besides few words.
    However, with a coding solution to be more specific and 'pseudo-dynamic' it should be able to process coding the minimal approach for entries more suitably and efficiently.
    Attached Files Attached Files
    Last edited by SoraK05; 5th October 2014 at 19:50.

  12. #11
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I did something to be more dynamic than using frames for having multiple code tree configurations to suit the data arrangement than a global code tree whether forced or determined from criteria to suit static/variable/incremental code strings.
    This should be specific to suit the data layout by using multiple code tree configurations of different types throughout the file and be more specific to the data than using frames.


    The point is to look for the content that any compression tool can target, which is detection of prevalence (in multiple ways) and their written representation (in multiple ways) in order to accommodate the most possible data to compress based on prevalence and ultimately in the least bits, providing the means to detect and write in multiple approaches and treat the file as it is.
    Time and temporary storage are of no object (however ASM is suggested) and minimal bits is the objective. Any variations to scale and approach such as sliding window are all optional toggles that can be included to the detection/writing components


    Arithmetic coding generally compresses data more than a global code tree whether static/variable for the file as it looks at prediction of data placement to be more suitable. This should be suitable and specific to data arrangements and have multiple configurations throughout the file according to different code types and their criteria as well as reduce configurations.
    Attached Files Attached Files
    Last edited by SoraK05; 27th October 2014 at 23:47.

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Let us know when you have some benchmark results.

  14. #13
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post

    First I'm putting down all the theory expectations and a functional process on paper for approaches to capture prevalence and write it, and with weights to automatically look to be efficient for a data stream all round.
    There's still some more to do, like the math and the shuffling data to suit priorities like grouping in a row data e.g.

    In the meantime before all is on paper per se, if anyone wants to code the current selection of available methods working concurrently feel free.
    I can look to program once all is done and can perhaps look at loose ends here or there.

    Most compression tools use components written here, although these look to push its type to its limit.
    For example I don't know of any compression tool which specifically captures repetition in a row as a plain number, which can be substantial.

    Having minimal string, clones and in a row as well as a dynamic method to code with multiple code setups throughout should already beat things like 7z. There's a lot more beyond this too already written, including strings+gaps+nCr to pick at the bit layout of data for a smaller written file.

    Where chunk length gets a math update I believe the setup should beat majority video codecs using something like it simply from the method to get what there is available and with precision to bits and offsets than bytes.

    It is available to give others ideas for possible approaches and to suit their requirements.

    I'll update when I add more to the (possible) multiple approach text.

    Also for something like a hutter test which requires a specific timeframe and available memory, beyond quicker coding where the most intense form for smallest bits could be out of the bracket for the pc used in the hutter test, some kind of scan can be done to look for the most ideal processes to run based on data and the steps/room available for the pc to give a smallest result for the specs.
    This sort of thing is perhaps also what is looked for from the project, suiting certain specs when perhaps it can get smaller in size if gone beyond.
    Last edited by SoraK05; 31st October 2014 at 12:24.

  15. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    the simplest compression methhoid, rle, encodes number of repeated items. so it seems that you don't know even the simplest compression method

  16. #15
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    http://en.wikipedia.org/wiki/Run-length_encoding
    From the example "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWW WWWBWWWWWWWWWWWWWW" becoming "12W1B12W3B24W1B14W" looks very much like getting repetition in a row as a number.

    Thanks.
    This is not so common in compression tools, and I prefer to have repetition in a row specifically as its own component as well as the others.
    I didn't see rle specifically

    I'm looking at the specific components and methods of looking for prevalence and recording it in the suitable way for least bits, and that any compression software would target one way or another. I would get minimal representation of suitable strings from all the Ws and Bs for example (most likely 1 W and 1 B) as its own code, then the numbers of repetition in a row for where that happens for the Ws. the 3Bs would probably have its own code.
    There is also layering components with priorities to give something like repetition in a row as a number more opportunity to be more effective in the temporary mapping data.

    Anyway, first I look at components and then I can look at different ways to write for least bits possible. Some version of the above rle could be more suitable for some data arrangements, and ultimately could have to do with incorporating multiple elements in the coding tree to merge its data and reduce the tree data and entries in file.

    EDIT:
    Also, components including repetition in a row should only work if in advance with weight calculation the entries will reduce the filesize.
    The entry for 3 Bs beyond probably having its own string code wouldn't get its own repetition in a row number if it was not as effective treating it differently from the extra bits all round in the code tree / file for example.
    Last edited by SoraK05; 31st October 2014 at 23:53.

  17. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    my point is that you don't know anything about data compression and tries to reinvent the wheel instead of learning the theory. you idea isn't new but it's useless on real data. it is why Matt proposed yopu to ty to implement it yourself if you doon't believe us. don't expect that anyone else will implement it instead of you, the forum is full of much more practical ideas but short of people wanting to implement them

  18. #17
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    SoraK05,

    While Bulat makes some important points, he is perhaps a little hard on you. I think some of that is a result of some rather bold statements you have made that are not consistent the experiences and knowledge of people that are experts on the subject much more so than you or me. In my experience as a newbie and outsider, it is best to provide working proofs of concept along with ideas. That way, key members will often be more receptive to helping to teach the existing terminology, providing alternate ideas, or explaining concepts. I do see some similarities in our ideas, as Tree could be described (using your terminology) as a compressor that recursively looks for the most prevalent patterns and makes appropriate substitutions, so I think not all is lost if you are willing to take the time to learn the theory and terminology.

    When Bulat says "real" data, he is not talking about uncorrelated binary data that is barely compressible, which seems to be something you are keying in on. He is talking about data that can be significantly compressed in an amount of time that is reasonable. So nobody cares about being able to compress a file that is generated by, let's say, flipping a coin a trillion times and writing the heads/tails result as 0/1. Most people will know the result is not very compressible (at best) and so they are not even going to even bother trying.

  19. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    sorry, but there is nothing common between unusual compression idea and any lack of ideas. SK's work is like "you can do money by selling something". rle, delta compression, enropy coding is a few very simple basic blocks of real compression, so attempts to reinvent them just demonstrates level of SK's ignorance

  20. #19
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Prevalence in files fluctuates, and can be obtained with only a specific set of approaches.
    Based on the position, you have general patterns you can reduce in a proportional shrink like a direct huffman 'look for each byte and frequency', and beyond this there is 'clusters' which are dispersed around, 'chunk length' which has specific offset lengths apart (including math where applicable to do some addition/subtraction for the lengths), and you have 'in a row' being next to each other.
    Each one can be represented in the smallest form suitable for its type.

    A huffman 'each byte and frequency' will leave entire areas in a file having 0 in a row have repeating proportional coded versions for example. Clusters/clones where 'copies' of coded entries together appear around the file will all be written in the full form all around.

    Having repetition in a row as a number is the smallest possible way to capture that repetition.
    Having all clones/clusters detected to have one single 'version/copy' in the headering and a code to represent that is the smallest possible way to capture that repetition, giving each occurrence a shorter code.


    Where data entries are dispersed in a balanced way, such as there are 10 entries of each data, you can use a static code tree of 4 bits each to represent each entry than something variable which relies on more prevalence exponentially like 100, 70, 50, 30, 10.
    This requires multiple code tree approaches suitable to the dispersion. This also includes multiple code trees of the same type around the file - i.e. you can have 100s of variable code trees in a file all calculated to have the fewest and shortest codes suitable to the fluctuations in stretches, and then compress the code tree information to reduce the headering.
    Predicting placement to suit an area such as a stretch suitable for 4 bit static, then the stretch next to it suitable for variable of x entries and weight/frequency, then the stretch next to it suitable for 2 bit static, etc, is something to be done calculated and to be precise to the data arrangement than a formula which predicts possibilities - it is ideal to be exactly suitable to the file for multiple suitable code trees and to the stretches.


    My point is for any file fed to something with the capability of these functions, all these forms of data and their arrangements can be treated suitably for their prevalence and with least recorded bits for each component in mind, as well as the priority process.
    There are a limited amount of ways to detect prevalence and record prevalence.

    My approach is to consider any file and compress it in any way possible, where 1 bit reduced is eligible including overheads. If a file only has one single bit in it which repeats as 0 in a row for 4096 bits and the rest of the file is unprevalent/uncompressible, it should be able to detect this single part and compress it and otherwise mark the rest as 'raw' with coding and have the original form written, to the exact offset of its appearance.

    The approach mentions the general ways to get a file to compress by its prevalence when looking at bits of 0 and 1 in different arrangements. Time/space is of no object. In general, including the computer steps when optimized to achieve compression, there is a limit for what is technically possible to compress to the smallest in steps and temp space.
    For space/time reasons for bandwidth or media, you can use specific methods which are suitable for this.


    Any file fed can potentially be compressible based on prevalence to capture and the method to write this. You require repetition/prevalence. A math formula such as 1+2+3+4+5+6 or PI or 1*3*5 over and over repeated constitutes prevalence as it is a base which is repeated and acknowledged.
    From bit makeup, strings can be suitable to compress, or instead you can look for gap placement where there are many 0s and few 1s for example and all you do is measure the distance between all the 1s and record those lengths and compress them further where applicable. Both strings and gap placement can compress the file, but gap placement will make it smaller according to math constraint from there being more prevalence and this method being more suitable for the most part.


    I may not know technical terms or names of compression algorithms, but from the makeup of bits and their arrangement, and looking to determine prevalence and record it in the smallest way possible so that any file can be scanned exhaustively for any compression possible at all, I write some approaches in the text.
    Time/space is not an object, only smallest possible file and to get even 1 bit from a mostly non-prevalent file, but scaling and being specific is possible and tweaks. By design all is calculated in advance to confirm if compression is possible from all components and reject processing the file if it cannot reduce even by 1 bit.



    Any compression algorithm looks for the same things ultimately. You cannot compress a file without prevalence, even after using some method to transform the data and break even. This prevalence is the same prevalence tools look for but scaled perhaps and in chunks or so.

    7z looks for patterns, 'clones/copies' and repetition in a row ultimately when it looks for offset+length.
    Video codecs which look to get compression common in frames ultimately looks for chunk length data.

    These components are however approached to be at their max, with thorough scans, so that for something like chunk length data any occurrence in the entire file anywhere can be recognized than simply on a frame to frame basis.
    These things constitute a maximum compression based on the components targeting their maximum potential.

    It looks at bit makeup, it doesn't rely on any form of scheme but to do scans to determine exactly what the file has and is capable of on its own.



    When 7z looks at byte patterns, clones/clusters and in a row with its offset+length, and can even do things in chunks on 'ultra', a tested file of 1024x768 24bpp 8 colours where each colour repeats for 4 lines (4096 times) and that chunk repeats 24 times results in a file of '1KB'.
    This method should get it to even ~<50B.

    A 1GB file of 0 in a row should be able to compress to a few bytes, yet 7z is far more than this.
    General tools look for space/time shortcuts.

    This may be on paper, but when programmed it should get the data patterns for the components captured effectively, and get something like the 1GB file to few bytes or that bmp example to the ~<50B as it is capable.



    When looking at the 100MB text file for example for the hutter prize, there is a lot of prevalence by hand and by eye.
    Having a versatile thorough tool to get chunk lengths anywhere in the file can benefit a lot as well as clones/clusters accurate to the file and by weight for smallest file.
    From bit makeup, using minimal possible strings and even gap/ncr where appropriate should get it even smaller in size. Having multiple code tree configurations to exact offset ranges which suit the data for a smallest form should also bring the file down.


    Considering the text is a generally shorter range of possibilities and a lot occur frequently, I believe from looking at prevalence from the multiple approaches it should be able to reduce by quite a bit more than the current size it is.
    Without 'schemes/models' it should treat any file in the same manner.


    Kkreiger, a 96kb 1 level game expands to few hundred MBs from a procedural approach of compression using a very small colour palette of I believe 16 colours and very repetitive textures, a generally compacted map and so on. When looking at something like 7z, one may find 96kb to be a feat. When looking at the multiple components in this text and the amount of common prevalence it should get it smaller than 96kb.



    If you aim to compress a file to its bone and where possible, you will be looking at any possible approaches to determine and capture prevalence, and they are the same methods any compression can possibly use.


    There were some (over)estimations in the past and in previous threads, however, the concept is down to a math constraint/limit to what is possible to compress in a file and how many steps/temp storage is required.
    Last edited by SoraK05; 1st November 2014 at 10:46.

  21. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Prevalence in files fluctuates, and can be obtained with only a specific set of approaches.
    what you mean here? does bwt unable to help compress the data?

    like a direct huffman 'look for each byte and frequency'
    you don't know that compression is divided into modelling and entropy coding stages and huffman or ari used to encode frequencies provided by models rather than input bytes directly. sorry, i don't have time to read rest of your long message. i can only repeat that you don't described any specific compression methods but a set of naive "bricks" so there is just nothing concrete to implement. real compression methods inlcudes a smart combination of well-developed compression ideas and chances that you can compete with them without learning first are the same as you will reinvent the Mathematics without learning the Arithmetics

  22. #21
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Huffman traditionally is described as byte+frequency as something plain.
    Prevalence appearing in multiple ways and different approaches is discussed.

    If you decide to read the post, you can see for yourself.
    If you read the text file, you see how it is the multiple components in a specific priority to inherit/compliment.

    If you read the post, you see I mention that there is a math constraint to how much a file can possibly compress from different ways which any compression algorithm will inevitably target, and as well as temporary storage/computer steps. Scaling for time/space is something else.

    And there is more refinement to the text to include, as it is WIP.

    EDIT:
    Regarding bwt, moving data / shuffling data to specific priorities for compression is also mentioned, as well as encrypting to attempt to encourage more compress and fake extracting/compressing.
    These are all things mentioned above:

    DEVELOP SPECIFIC TARGET RANGES THAN PROCESS THE ENTIRE FILE
    MODE 0: AREA SCAN

    DEVELOP AN INITIAL PROPORTIONAL SHRINKING OF THE FILE
    MODE 1: STRINGS
    MODE 1: POLARITY GAPS
    MODE 1: NCR
    MODE 1: MATH

    TARGET PREVALENCE FROM POSITION, AND CAN INTERNALLY REPEAT
    MODE 2: CLONE/CLUSTER
    MODE 2: CHUNK LENGTH (INC. MATH)
    MODE 2: IN A ROW


    TO CONSIDER: [TO DO]
    MOVING/SHUFFLING DATA
    [LIMITED]
    FAKE EXTRACTING AND TREATING LIKE POORLY COMPRESSED AND RECOMPRESSING
    FAKE ENCRYPTION TARGETTING KEYS WHICH CAN PRODUCE PREVALENT RESULTS
    TARGET APPROPRIATE CODING OF DATA
    MODE 3: STATIC / VARIABLE / INCREMENTAL CODE TREES
    Quote Originally Posted by Bulat Ziganshin
    real compression methods inlcudes a smart combination of well-developed compression ideas
    It is written how to use a temporary mapping file and work on it to inevitably have a tailored file.
    It seems you haven't read.
    Last edited by SoraK05; 1st November 2014 at 11:06.

  23. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    do you have a concrete algorithm consisting of well-described steps? what is the time that will algorithm require to compress 1 kb file?

    Huffman traditionally is described as byte+frequency as something plain.
    yes, it is because description should be simple. but it doesn't m,ean that real compression algos are limited to such usage. once again, the problem is that you know too small about compression and reinvents the wheel instead of learning the existing design space

    i read the first version of the file and, sorry, will not read newer versions. even the brief contents list shows that you don't have a little math knowledge - the methods you propose will take billions of years to compress a few bytes

    PS: i don't fight with you. all that i and Matt are trying - is to help you start learning math, information theory and data compression instead of inventing the wheel
    Last edited by Bulat Ziganshin; 1st November 2014 at 12:36.

  24. #23
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Time is no object but in principle working on smallest result.
    I don't have the exact math steps and pseudocode, but an overview to implement.

    To compress a 1KB file depends on the content of the file and which components are used.
    If there is anything in a row worth doing, it will do it. If there is anything in chunk lengths anywhere in the file, it will do it. Depending on the file it can make multiple code trees for the file to use throughout the file which are appropriate. It is up to the file.

    Scaling for space/time can be done also, however an option for smallest (possible) is the point.


    Also, each method does have some idea in mind to do only what is necessary.
    Minimal variable strings (instead of bytes and frequency) does look to do minimal brute scans. Any other brute scans are also constrained to check prefixes gain/lose weight to bother continuing. There's some things in place to make scans only necessary.


    EDIT:
    I use huffman as example for strings, taking byte+frequency.
    There is minimal variable strings instead of fixed 8 bit, there is looking at gap lengths, there is using nCr, there is math (to do). It is not only taking (string) byte+frequency but other approaches to code and use in a code tree for the file.

  25. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    there is wlll known algorithm that reaches the smallest possbible size - kolmogorov complexity. the problem of such algorithms, including your one, is that they takes billions of years to compress few bytes. you just never thought about that

  26. #25
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I'm looking at kolmogorov and don't necessarily see what goes on to reduce the size for prevalence.

    I do not think it will take billions of years to compress few bytes.
    Few bytes can be even 128 bytes / 1024 bits. If you do multiple constrained scans you can end up scanning it maybe few thousand times at a high. In general, some few thousand times for files at an expected high.

    It does require multiple scans through to develop the information, however with asm which is far faster than higher level languages and with shortest steps in mind for the architecture, and even without any gui information to keep track of to hold back the process like printing even a progress bar, it should manage in reasonable time.

    Those few bytes should be doable even with a few thousand scans left-to-right in far less than billions of years.


    EDIT:
    It depends on the content of the file. If the file has data to compress quicker, with fewer entries to rescan, it can be even just a handful of scans. At a high, few thousand scans left-to-right is expected for the more complex files.
    Those which cannot compress will give out after even a hundred or so from finding nothing at all at most to progress.


    EDIT:
    Also, the thousand scans mentioned is not only on the original file but on the developing temporary mapping file which ends up having lesser and lesser data after the initial MODE 1 part is done. So the scanning is continued on smaller and smaller temporary mapping files.
    Last edited by SoraK05; 1st November 2014 at 13:39.

  27. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    FAKE ENCRYPTION TARGETTING KEYS WHICH CAN PRODUCE PREVALENT RESULTS
    with 128-bit key (i.e. just 16 bytes) this requires 2^128=10^40 attempts. even if each attempt takes 10^-9 seconds, overall it will require 10^30 seconds. and it's just one example. by checking only few thousands varioants, ypu will get much less compression than gzip or so. you idea (that's not original) is to randomly test various weays to represent the data and it will take billions of years once you are going to check more than just a few thousand variants

    but it seems that you aren't going to learn math at least. i will not answer you anymore
    Last edited by Bulat Ziganshin; 1st November 2014 at 15:45.

  28. #27
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    The problem with one click compression is that it will never work for all files.

    If you make a general compressor not matter how perfect. You might as well use the copy command since it is the optimal compressor for the set of all files.

    Any time you attempt to write a general compressor for all files. If it compresses some files smaller to save a few byte. Then there are other files that when put through the compressor that get longer. In fact its not a fair trade. If you try testing with random files you will see the average random file is expanded so best to stick with copying the identity transform or even something like BWTS would have the same optimal compression for the set of all files if you prefer to use something more exotic than a straight copy.

    All so called general compressors to be of any use at all target specific files structures that occur in the real world. You must test your coders on some set of files or files with specific properties. The fact is the type of files in common use change with time. Something that may have been almost ideal for general files in use 20 years ago. Likely would be of little use today. It's a constantly changing target.

    look you could have a one click compressor that tests 2**800 ways to compress say a file of 100 bytes. The problem is say one of the ways you find for the particular input file to 5 bytes. Know when you write the output file beside the 5 bytes out you need 100 bytes to tell the decompressor which method was used to compress the input file. So in this case you really inceased the output file by 5 bytes since you need the control information.

  29. #28
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Bulat Ziganshin View Post
    with 128-bit key (i.e. just 16 bytes) this requires 2^128=10^40 attempts. even if each attempt takes 10^-9 seconds, overall it will require 10^30 seconds. and it's just one example. by checking only few thousands varioants, ypu will get much less compression than gzip or so. you idea (that's not original) is to randomly test various weays to represent the data and it will take billions of years once you are going to check more than just a few thousand variants

    but it seems that you aren't going to learn math at least. i will not answer you anymore

    I am familiar with the concept that having an encryption being an algorithm to modify data and a key to use as the part to go with the algorithm can be detected.
    If you know what the encrypted data and decrypted data looks like, and you know the algorithm, it is definitely possible to isolate what the key is much quicker than a brute force.

    On the most basic level encryption can involve incrementing all data by the number represented by hex - i.e. all bytes increment by 9 when the key is 09.

    Considering in advance you can already have an idea of what data can become, you can intentionally design an algorithm to look at uncompressible data and apply something to intentionally produce a result which looks prevalent instead. This is limited, but possible.

    This does not involve consistent brute forcing. It would target specific data only and with a model in advance to do minimal scans and force a prevalent result where able.


    If you look at hashes, something like a MD5 has a certain length in bytes for its hash.
    If you know something like the length of the file it represents, then you know that there are x amount of possible files of that length which can possibly suit that hash.
    If you know details about the contents of the file, you cut down the possibilities from x to y amount of possible files.


    Some data is prevalent as it is, like many 0 bits, or common patterns all around the file. It would be detrimental to look converting this kind of data since it is already ideal to compress, and is the data which you would expect compressible/encryption modified information to resemble.

    One would target less prevalent data and strictly the less prevalent data which in advance from its makeup can possibly transform with appropriate math values applied to something more compressible and break even.

    It is a very specific targeting process with a basic algorithm to minimize and localize only necessary possible transformation scanning.


    I did not write this in the writeup, but this is a plan

  30. #29
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by biject.bwts View Post
    The problem with one click compression is that it will never work for all files.
    From the examples, by having all the components separately, a data file like 1GB of 0 bits has the potential to be its smallest in the few bytes it is capable.
    Where a file has the same 10 bit pattern throughout most of the file, you can use minimal variable string scanning to get this pattern in the entire file and optimally represent it.
    Where there is a lot of repetition which appears in specific lengths apart, you can get this anywhere in the file.

    Having the components gives the potential to reduce the file to the smallest it is capable.
    1GB of 0 bits cannot compress more than '0' and '1GB' as a number, which is catered for by repetition in a row for example.

    By catering with components for multiple approaches to capture prevalence and record it suitably for its type, and any other tweaks to merge or represent, you give the potential to allow for files to be their smallest in general.

    The point is to have all possible components of detecting prevalence and writing work together with weights and be suited.

  31. #30
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    SoraK05, please read the following and come back in a year when you have a working program. Really, I'm trying to help.
    http://cs.fit.edu/~mmahoney/cse2050/introcpp.html
    http://mattmahoney.net/dc/dce.html

Page 1 of 2 12 LastLast

Similar Threads

  1. BCM - The ultimate BWT-based file compressor
    By encode in forum Data Compression
    Replies: 126
    Last Post: 22nd May 2020, 06:06
  2. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 09:26
  3. CM on GPU - I need ideas for parallelization
    By namibj in forum Data Compression
    Replies: 5
    Last Post: 16th August 2013, 17:22
  4. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 10:14
  5. Ideas for PIM
    By Mapi in forum Forum Archive
    Replies: 1
    Last Post: 18th June 2007, 22:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •