Results 1 to 15 of 15

Thread: Detecting non-immediate data correlations

  1. #1
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Detecting non-immediate data correlations

    In a lot of files, there are local but not immediate neighbor correlations between bytes or words. The obvious example is an image file format where a pixel sample from the previous row is very similar to the current pixel, but that pixel is not immediately adjacent in the file.. it's lagged at a distance equal to the width of the image.

    A more subtle example is encoding a file of 4-byte floating point numbers. If the FP numbers are weakly correlated, then each byte is still nearly uncorrelated with the previous byte. But the MSB of the word (which holds the exponent of the floating point number) is extremely likely to be similar to the MSB of the previous word (located 4 bytes back.) The mantissa bytes may also have correlations but they're more likely to be noisier and more weakly related. All of this depends on the exact file, (there are also strategies specifically for FP compression), but I'm just talking about detection of correlations in general.

    Or for that matter, imagine an RGB format where each value is unrelated to the value before it, but strongly related to the value three words back. Classic PPM/BWT/LZ77 will miss any of these opportunities.

    My question.. assuming you have an input file with no information or hints from the user, what's a good way to detect these kinds of correlations? If you can parse files to recognize an image file format, you may get a clue, but surely there's some approach that can build up statistics to find correlations automatically even on a novel file. (I know that many compressors (PAQ included) have filters that try to detect known formats like JPG, but lets assume we have no such custom detectors.)

    I could imagine building a correlation detector which makes a table of say 1024 "history" correlations, just relates byte x with say each of the 1024 bytes before it.. this is basically a limited window autocorrelation. But that may not work well with for example the floating point example where the LSB of the word tends to be uncorrelated with anything, and therefore its noise will considerably dilute any correlation for the MSB byte.


    Are there strategies for searching for these kinds of correlations in a file? I keep thinking of the windowed autocorrelation table, or an integer FFT or something, but I don't know.

    I'm not even worried about trying to use any correlations yet in compression, it's just a topic that keeps bouncing in my head.. some kind of strategy where if compression on part of a file is not working well, is there some kind of transformation of the file that does compress better.

    I deal a lot with scientific data types which are full of repetitive data stored as packed binary structures, and I know that eventually there's a way to find and exploit that structure.

    Any brainstorms?

  2. #2
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Code:
    	while(progress<size-16)
    	{
    			sameDist[0]+=(src[progress]==src[progress+1]);
    			sameDist[1]+=(src[progress]==src[progress+2]);
    			sameDist[2]+=(src[progress]==src[progress+3]);
    			sameDist[3]+=(src[progress]==src[progress+4]);
    			sameDist[4]+=(src[progress]==src[progress+8]);
    			sameDist[5]+=(src[progress]==src[progress+12]);
    			sameDist[6]+=(src[progress]==src[progress+16]);
    			sameDist[7]+=(src[progress]==src[progress+32]);
    			succValue[0]+=abs((signed)src[progress]-(signed)src[progress+1]);
    			succValue[1]+=abs((signed)src[progress]-(signed)src[progress+2]);
    			succValue[2]+=abs((signed)src[progress]-(signed)src[progress+3]);
    			succValue[3]+=abs((signed)src[progress]-(signed)src[progress+4]);
    			succValue[4]+=abs((signed)src[progress]-(signed)src[progress+8]);
    			succValue[5]+=abs((signed)src[progress]-(signed)src[progress+12]);
    			succValue[6]+=abs((signed)src[progress]-(signed)src[progress+16]);
    			succValue[7]+=abs((signed)src[progress]-(signed)src[progress+32]);
    			progress++;
    	}
    
    	maxSame=minSame=sameDist[0];
    	maxSucc=minSucc=succValue[0];
    	bestChnNum=0;
    
    	for (i=0;i<8;i++)
    	{
    		if (sameDist[i]<minSame) minSame=sameDist[i];
    		if (sameDist[i]>maxSame) maxSame=sameDist[i];
    		if (succValue[i]>maxSucc) maxSucc=succValue[i];
    		if (succValue[i]<minSucc) 
    		{
    			minSucc=succValue[i];
    			bestChnNum=i;
    		}
    	}
    
    
    	if ( ((maxSucc>succValue[bestChnNum]*4) || (maxSucc>succValue[bestChnNum]+40*size)) 
    		&& (sameDist[bestChnNum]>minSame*3)
    		&& (entropy>700*size || diffNum>245) 
    		&& (sameDist[0]<0.3*size))
    	{
    		//printf("delta:%d %d %d %d %d %dr:%d \n",succValue[0],succValue[1],succValue[2],succValue[3],succValue[4],sameDist[5],bestChnNum);
    		return DT_DLT+bestChnNum;
    	}
    I made such code, sometimes works but not perfect

  3. #3
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi, Fu. Looks like you are using a windowed autocorrelation to brute-force look for duplicates and/or small integer differences.
    This is something similar to what I'm thinking too, but I do know it's flawed. (Still, it's SOMETHING.)

    Like you, I thought of making a set of fixed "likely" correlation distances. I like how you include both 3 and 12, which feel like likely strides for data triples like xyz or RGB in byte or word steps.
    I'm also thinking of going further and recording such autocorellations at different PHASES, too, to get around the "correlation between chars with 0 mod 4 is high" interleaving correlations typical for FP numbers. What I mean by storing phases is making arrays similar to yours, but probably make 12 or 24 independent copies. The byte's address mod 12 decides which of the 24 the data gets recorded in.

    Thus, the 0 mod 4 case will show up as a strong outsider since you'll see it affect the 0, 4, 8, 12, 16, and 20 mod 24 tables. 24 is chosen because it's 8*3.. the 3 is for RGB/XYZ and 8 is for 64 bit integers or double floats. The 24 tables would let you find byte correlations with masks 1, 2, 3, 4, 6, 8, 12, and 24 bytes apart. (All the factors of 24). The correlation DISTANCES could be a table perhaps with all the distances of 1-16, 24, 32, 36, 48, 64, 100, 128, 256, 300, 400, 512, 640, 1000, 1024, 1280, 640*3, 640*4, 3*1024, and a few more. (those strides are picked ad-hoc as likely image strides which might be hidden in opaque binary data.)

    Of course it's all ad-hoc, and most of the reason I posted was to see if there were other tricks to try. Maybe some fancy mathematical transform that could somehow magically say "oh, look at the byte 4380 bytes previously in the stream! It's usually correlated, at least enough to give you an encoding clue!" Obviously fixed tables can't find such distant correlations without using thousands and thousands of tables.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts
    > I deal a lot with scientific data types which are full of repetitive
    > data stored as packed binary structures, and I know that eventually
    > there's a way to find and exploit that structure.

    The best way is just to have custom models for all relevant structures.
    Actually its the only way to avoid misdetections etc - you won't know
    until you actually compress the data.
    For example, see http://encode.su/threads/1088-Someti...ok-like-random

    So imho its not really a problem of runtime detectors, and for offline
    structure analysis you can usually afford limited bruteforce search.

    Also note that you'd have a huge redundancy if you'd try compressing text
    and image pixels with the same probability estimators, even with known
    image width. There's no such thing as a perfect memoryless model either,
    so context selection is not the only problem.

    But then, there're still lots of different file formats and structures,
    so its best to simplify the manual model definition and compressor
    construction as much as possible... imho that's the purpose of zpaq project:
    http://mattmahoney.net/dc/zpaq.html ... although unfortunately for now
    it only defines a modelling VM, but doesn't simplify custom model construction
    at all, comparing to C++ cut-and-paste.

    Still, there're surely lots of attempts to do what you ask - sometimes
    (ie to detect raw audio structure) its enough to try all the common possibilities
    (16bit/8bit stereo/mono) and estimate data entropy with a simplified model.
    For example of such model you can see http://ctxmodel.net/files/PPMd/segfile_sh2.rar
    (which implements another important stage of structure detection).

    Another example is http://freearc.org/download/research/delta10.zip
    (there's a description in the source)

    Also http://libbsc.com/Documents/bsc-2.4.0-src.zip
    (look for bsc_detect_recordsize)

    Anyway, common methods:
    1. (best compression) Compress the data with the main model using a reasonable
    set of parameter values and choose the best result.
    2. Same, but estimate entropy with a simplified model and without actual coding
    (see seg_file)
    3. http://en.wikipedia.org/wiki/Mean_square_error =>
    Sum (a-b)^2 = Sum a^2 + Sum b^2 - 2* Sum a*b =>
    http://en.wikipedia.org/wiki/Autocor...nt_computation
    like you mentioned
    4. LZ match analysis:
    http://www.fantascienza.net/leonardo...tatistics.html
    With more different match lengths allowed it would be also reasonable to use FFT.
    Also its the only way to detect the syntax of structured files with no fixed-size records,
    like webserver log files.

    Still, imho its more of a technical problem comparing to implementing a _good_ model for
    data with known structure. For example, afaik we don't have a public implementation of
    a proper model for 32-bit floats - its clearly not enough as "geopack" shows:
    http://nishi.dreamhosters.com/u/FLTgeo3b1.rar
    Code:
    102400 geo     // file from Calgary corpus with weird ancient floats
     43869 geo.paq // paq8px_v69 -7
     97472 geodump // parsed floats converted to int32s and w/o headers
     43762 geodump.paq // paq8px_v69 -7
     39925 1       // geodump compressed with geopack
       200 2       // header template (uncompressed)
     40047 geodump.ha // geopack (.ha is used to concatenate files)
    Note that its from 2003, and there's been a lot of improvements since then.
    And paq8px _can_ detect the 32-bit record size here, but there's still
    like 20% redundancy in its results - do you think its ok if you're able
    to detect the structure, compress it with the same model, and show better
    compression than lzma or something?

  5. #5
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    A simple way for low distance-records would be de-interleaving the file and determing order-1 entropy. Sum of absolute values or "energy" if you square it does often not work very good.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Last edited by Bulat Ziganshin; 30th November 2010 at 12:44.

  7. #7
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Thumbs up

    Quote Originally Posted by Shelwien View Post
    > I deal a lot with scientific data types which are full of repetitive
    > data stored as packed binary structures, and I know that eventually
    > there's a way to find and exploit that structure.

    The best way is just to have custom models for all relevant structures.
    Actually its the only way to avoid misdetections etc - you won't know
    until you actually compress the data.
    For example, see http://encode.su/threads/1088-Someti...ok-like-random

    So imho its not really a problem of runtime detectors, and for offline
    structure analysis you can usually afford limited bruteforce search.
    Ah, excellent link, that's the kind of "puzzle" data I was thinking might be detectable. That thread is also fun because it's manual analysis and shows some of the techniques that you could try to attack it with, and I love the different approaches you guys explored with it.


    Also note that you'd have a huge redundancy if you'd try compressing text
    and image pixels with the same probability estimators, even with known
    image width. There's no such thing as a perfect memoryless model either,
    so context selection is not the only problem.
    True, it gets expensive and impossible to find all such correlations in a file. But even a crude test may pay off. You have 100MB and brute-force some correlations using the first 10K of data, it's very possible that those correlations could extend across the whole 100MB of data.
    An ultimate research compressor would try to peel out even dynamically changing structure, but sometimes that low hanging fruit would be enough. Think of my example of a scientific app that is saving a big array as a giant dumb binary dump. There's 1M copies of a 48 byte structure. If you can find that period of 48 (and there was correlation between adjacent records) you might do well.



    Still, there're surely lots of attempts to do what you ask - sometimes
    (ie to detect raw audio structure) its enough to try all the common possibilities
    (16bit/8bit stereo/mono) and estimate data entropy with a simplified model.
    For example of such model you can see http://ctxmodel.net/files/PPMd/segfile_sh2.rar
    (which implements another important stage of structure detection).
    Yes, filters that look for known filetypes are clearly a big win.. not just in theory but in practice. PAQ is an extreme example, but if I recall correctly, Stuffit does really well because they have a lot of format detectors and therefore can do a good job with a lot of common data like MP3s and JPGs.


    Still, imho its more of a technical problem comparing to implementing a _good_ model for
    data with known structure. For example, afaik we don't have a public implementation of
    a proper model for 32-bit floats - its clearly not enough as "geopack" shows:
    Yep, FP data is a special breed, and common. And unless it's packed with repeated values, it tends not to compress very well. We get spoiled by the endless clever techniques text gives us, especially in context modelling, but raw numbers just don't have the same richness. But of course they're still common.


    One of the real apps I'm experimenting with now is very fast compression of raw binary memory just to speed up GPU computes! A GPU has huge memory bandwidth (150GB/sec) but communicating between GPU and CPU is only about 5GB/sec. And sometimes you have to communicate GPU to GPU which uses TWO transfers. The idea is that if you can quickly compress data (in an unknown format!) you get a very useful boost in communications bandwidth. You take your 100MB of binary data, spend a few milliseconds compressing it, getting say a simple 25% savings, send the compressed data over the PCIe bus, and decompress it. The win is the increase in effective PCIe bandwidth. (This is common on the GPU, compute is free, but transfers are not.)
    Since the data being transmitted is opaque binary, finding any correlations is useful. And seemingly wasteful brute force searches are often easy on the GPU. Large caveats apply of course.

  8. #8
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Now that experiment is so interesting! I love the simplicity of the model.. find correlations at a certain offset, forming virtual columns of data, then shuffle the (correlated) column into a consecutive row that might be able to exploit the correlations.

    That's a really straightforward way to attack the blind structure-seeking problem.. its simplicity means it may be pretty robust to all kinds of data.
    Bulat, thanks for that work (and the source code), since it definitely meshes with some of the approaches I've been thinking of.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts
    1. I tried to point that your "scientific app" doesn't normally change
    the output format all the time, so you don't have to find structure
    in its data each time. Instead, you have to build a custom coder
    (or a preprocessor for common coder) and then just use it for your data.
    In other words, I think that file formats are fixed in most cases
    (=known or have to be analyzed once), and actual structure analysis
    is only necessary when a new filetype appears, and thus its speed
    doesn't matter - it can be even done manually for now, with tools
    like
    http://www.sweetscape.com/010editor/templates.html
    or http://flavor.sourceforge.net/

    Also check out http://encode.su/threads/271-mp3dump
    The structure is known there, but it doesn't really help.

    2. To clarify things, seg_file is a file segmentation utility,
    it splits given data into blocks by trigram statistics, ie
    separates text from code in executables.
    And I think that its entropy model (a hashed bytewise o2)
    can be as well used for other types of analysis.

    3.
    > Yep, FP data is a special breed, and common. And unless it's packed
    > with repeated values, it tends not to compress very well.

    I disagree here. Float-point numbers are usually pretty compressible
    (especially doubles) - all the techniques from lossless audio coders
    apply, etc.
    The main problem there is that all the LPC/delta techniques produce
    a noticeable redundancy when dumbly applied to floats, because the precision
    required to encode a float delta is more than precision of the float itself.

    4.
    > A GPU has huge memory bandwidth (150GB/sec) but communicating
    > between GPU and CPU is only about 5GB/sec.

    That "only" is a bit suspicious :)
    I mean, what kind of data your cpu can even generate at 5GB/s rate? :)
    GPU->CPU may make some sense, but again, 5GB/s is more than a byte per cpu clock -
    imho it can only work if cpu can use the data directly in "compressed"
    form and with nearly no overhead - ie more or less limited to selecting
    the right number format.

    Also what kind of GPU can demonstrate 150GB/s memory i/o?
    I experimented with my 8800GT @ 1.715Ghz and a global memory access
    took ~100 clocks there, so even if it has a 256bit memory bus,
    its still ~550MB/s (1715/100*256/8) and 17561MB/s if all 32 cuda cores
    there can really access memory independently with no extra delay.
    And it claims 64GB/s memory bandwidth too.

    > Since the data being transmitted is opaque binary, finding any
    > correlations is useful. And seemingly wasteful brute force searches
    > are often easy on the GPU. Large caveats apply of course.

    You may be right about potential GPU usefulness for data structure analysis,
    but for now I'm pretty sceptical about that. I tried implementing a dumb
    md5 bruteforcer with cuda, and it was barely faster than my (overclocked) cpu
    (GPU:~130Mp/s vs CPU:~100Mp/s).
    And hash cracking is a perfect task for GPU, real ones would do worse afaik,
    especially at compression / entropy estimation which would use memory
    for statistics.

    And anyway, why is it "opaque binary"?
    You can provide some way for the user of your supposed library to
    define the data structure, and also optionally provide an utility
    which could analyze the data and generate a structure description
    for use in runtime.
    Either way, forcing the GPU to determine the (usually fixed) data
    structure in runtime feels like a waste, even if it has nothing
    better to do.

  10. #10
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    1. I tried to point that your "scientific app" doesn't normally change
    the output format all the time, so you don't have to find structure
    in its data each time. Instead, you have to build a custom coder
    (or a preprocessor for common coder) and then just use it for your data.
    You're right that the analysis only needs to be done once if the app is unchanged.
    But I also want to make a library that can do the compression with no knowledge at all, since it'd be open source code that people would just call. (But actually my interest goes deeper, the GPU fast-compressor is just one tangible first pass.)

    2. To clarify things, seg_file is a file segmentation utility,
    it splits given data into blocks by trigram statistics, ie
    separates text from code in executables.
    And I think that its entropy model (a hashed bytewise o2)
    can be as well used for other types of analysis.
    I rather like this.. it's a simple analysis but it seems to work pretty well and its simplicity makes it robust.

    > A GPU has huge memory bandwidth (150GB/sec) but communicating
    > between GPU and CPU is only about 5GB/sec.
    That "only" is a bit suspicious
    I mean, what kind of data your cpu can even generate at 5GB/s rate?
    GPU->CPU may make some sense, but again, 5GB/s is more than a byte per cpu clock -
    imho it can only work if cpu can use the data directly in "compressed"
    form and with nearly no overhead - ie more or less limited to selecting
    the right number format.
    Actually it's usually not that your CPU is creating data at such a fast rate. It may just be doing nothing except hosting some data that is too large for the GPU to hold onboard. So the CPU is just responding to multiple "give me this chunk of data" responses. And the speed of that transfer is often a bottleneck.


    Also what kind of GPU can demonstrate 150GB/s memory i/o?
    I experimented with my 8800GT @ 1.715Ghz and a global memory access
    took ~100 clocks there, so even if it has a 256bit memory bus,
    its still ~550MB/s (1715/100*256/ and 17561MB/s if all 32 cuda cores
    there can really access memory independently with no extra delay.
    And it claims 64GB/s memory bandwidth too.
    Here we're drifting offtopic, but that's OK. GPU programming is actually my "real job".. compression is a side research which I love just because it's so interesting.

    The latest GPUs have huge memory bandwidth.. about 190 GB/sec for the very newest GTX580. It has 384 bit bus of GDDR5 at 4GHz (effective speed)=192GB/sec. The 100 clock timing you give is a latency measurement (and it's more like 400 clocks with the latest boards), but throughput is really that high. And that's counting actual DRAM fetches from offchip. The latest GPUs have caches, which are small compared to CPUs, but very fast.

    Even with that bandwidth, it's easy to saturate since compute is so cheap. So a lot of GPU programming is all about managing memory transfers and less about math.

    Re: your 8800GT analysis, GPUs are interesting because unlike a CPU, the cores have many multiple memory transactions in flight at once.. it's not unreasonable for hundreds of simultaneous memory transactions to all be pending per independent GPU core. This is probably the core difference between CPU and GPU hardware.. the GPU is designed for high throughput at the expense of high latency which it hides by the massive number of simultaneous operations. The CPU is concerned with minimizing latency and juggles only 2 or maybe 3 operations at once per core. Hyperthreading is a trick which doubles the number of transactions the CPU can hold pending at once.

    You may be right about potential GPU usefulness for data structure analysis,
    but for now I'm pretty sceptical about that. I tried implementing a dumb
    md5 bruteforcer with cuda, and it was barely faster than my (overclocked) cpu
    (GPU:~130Mp/s vs CPU:~100Mp/s).
    And hash cracking is a perfect task for GPU, real ones would do worse afaik,
    especially at compression / entropy estimation which would use memory
    for statistics.
    Hash cracking isn't bad for the GPU, but not perfect. Elcomsoft has done very well with their GPU hash crackers though. And I myself wrote one for SHA1 cracking which completely trounced CPU speeds, about an order of magnitude if I remember correctly.

    But it's true that nearly all compression algorithms don't map well at all to GPU computing.
    My crudest tool for speeding up transfers that did work well (and is making me look deeper) detects only runlength (adjacent word) repeats and compacts the data down to fill those gaps, and on the other end reexpands out the repeats as words again. This works really well for some data obviously but it's not deep at all.

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by Shelwien
    You may be right about potential GPU usefulness for data structure analysis,
    but for now I'm pretty sceptical about that. I tried implementing a dumb
    md5 bruteforcer with cuda, and it was barely faster than my (overclocked) cpu
    (GPU:~130Mp/s vs CPU:~100Mp/s).
    And hash cracking is a perfect task for GPU, real ones would do worse afaik,
    especially at compression / entropy estimation which would use memory
    for statistics.
    http://www.golubev.com/hashgpu.htm
    http://3.14.by/forum/viewtopic.php?f=8&t=1333
    Seems that your cracker is too dumb

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    PAQ has a RepeatModel that detects data structures that repeat at fixed intervals. It records the locations of the last 4 occurrences of each byte value and detects when 3 adjacent gaps have the same value. For example, ...A....A....A....A would detect a record length of 5. You can reject record lengths of 4 or less and use sparse models up to 4 instead. This way when you have a 3-D model like a .bmp file of (height x width x 3), your record model will model the image width and the sparse model will model the 3 color components.

    GEO uses an obsolete 32 bit floating format where the exponent is a power of 16 instead of 2. The format is a non-standard predecessor to SEG-Y, which itself is a fairly ancient tape format for seismic data. I described it in http://mattmahoney.net/dc/dce.html#Section_2 section 2.1.3. There are several data traces that appear to be correlated. They have a much lower resolution than seismic sensors used today, which makes modern data harder to compress.

  13. #13
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Quote Originally Posted by GerryB View Post
    One of the real apps I'm experimenting with now is very fast compression of raw binary memory just to speed up GPU computes! A GPU has huge memory bandwidth (150GB/sec) but communicating between GPU and CPU is only about 5GB/sec. And sometimes you have to communicate GPU to GPU which uses TWO transfers. The idea is that if you can quickly compress data (in an unknown format!) you get a very useful boost in communications bandwidth.
    This is also an area i'm interested in, but the speeds mentioned here seem a bit too high to benefit from any compression scheme.
    I don't know much about GPU coding. As far as i understand, this does not work too well with branching, and most compressors are full of branching (excluding very specialized versions such as texture compression).

    Even on modern CPU, you would be hard pressed to find a compressor faster than 300MBps per core. And i don't think multi-threading would provide linear benefits either.

    Therefore, here you have it : for compression to become beneficial in a real-time communication channel, it must be faster than the communication channel itself. And we are far from the targeted speeds.

  14. #14
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    237
    Thanks
    39
    Thanked 92 Times in 48 Posts
    Quote Originally Posted by GerryB View Post
    Ah, excellent link, that's the kind of "puzzle" data I was thinking might be detectable. That thread is also fun because it's manual analysis and shows some of the techniques that you could try to attack it with, and I love the different approaches you guys explored with it.
    There's also http://encode.su/threads/1100

  15. #15
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •