Uploaded my ACB materials: http://ctxmodel.net/files/ACB.rar
Uploaded my ACB materials: http://ctxmodel.net/files/ACB.rar
Anybody figure this out? The paper by Buyanovsky describes a "funnel of analogies" that looks to me like a BWT transform. However it is not BWT. It looks like the context sorted pointers are only used to find matching contexts and the input is compressed in its original order?The paper is very hard to read.
Its not BWT. If anything, its LZ with advanced coding of matches.
First, match offset is encoded as offset of the best matching string
in the sorted list of contexts. And then, neighbours of that
best match string are used to determine the minimal match length,
and there's tricky coding for remaining match length too - its stored
as a number of occurences of the first mismatching symbol.
book1 result with acb 2.00c (best mode) is 222,690.
But it seemed really good back in 1997, and it was relatively
famous in Russia thanks to an article in a popular magazine.
Last edited by Shelwien; 20th January 2010 at 05:39.
I tried ACB on the LTCB but it failed on enwik9. It has a 64 MB file size limit so I divided enwik8 into 2 equal sized parts and enwik9 into 16 equal sized parts. On enwik8 it compressed to 25,063,656 in 1359 sec. and decompressed in 1347 sec. using 16 MB memory and max compression (command u). On enwik9 after running several hours I got this:
e9 is a subdirectory containing enwik9 in 16 parts named 01 through 16, each 62,500,000 bytes. The output when it crashed:Code:C:\tmp>timer acb u x9 e9 Timer 3.01 Copyright (c) 2002-2003 Igor Pavlov 2003-07-10 Associative Coder of Buyanovsky ACB ver.2.00c (Shareware) Copyright(C) 1994-97 George Buyanovsky 04.25.1997 Compress ( MAX_mode ) : e9 Data size= 1000000208 Passed... 386402304 bytes ILLEGAL INTERRUPT #00 eax = 00000140 esi = 00000024 flags = 3257 ds = 01C7 ebx = 00000000 edi = 00AF9503 eip = 00005B29 es = 01C7 ecx = 00000004 ebp = FE507E18 cs = 01BF fs = 0000 edx = 00000000 esp = FE507E08 ss = 01B7 gs = 0000 Kernel Time = 0.000 = 00:00:00.000 = 0% User Time = 0.000 = 00:00:00.000 = 0% Process Time = 0.000 = 00:00:00.000 = 0% Global Time = 43518.070 = 12:05:18.070 = 100%
Code:25,063,656 X8.ACB (enwik8 OK) 657,327,616 X9.00B 632,878,592 X9.01B 184,361,987 X9.02B
I was able to test ACB on enwik9 by producing 16 separate archives. It ranks 61 of 126. http://mattmahoney.net/dc/text.html
found this in comp.compression:
> Dear compression gurus,
> >
> > Does anyone recognize this compressed file:
> >
> > http://gdcm.sourceforge.net/thingies/dummy.acb
> >
> > This is reported as 'ACB archive data' but I cannot open it from my
> > debian/linux machine. I tried with p7zip, cabextract, unzip & gzip.
ACB was a compression format from the mid 1990s created by George
Buyanovsky and was called Associative Coding. It can only be
decompressed using the ACB utility, which is a DOS command-line
executable. You'll need to run a DOS emulator if all you have is
Linux. One place I found the decompressor is ACB200C.ZIP located
here: http://www.qumran.org/ftp/local/dos/app/arc/files.htm
This being a group mostly for the research and technical details of
compression methods, here are some obligatory links:
http://www.stringology.org/DataCompr.../index_en.html
http://cbloomrants.blogspot.com/2008/09/09-27-08-2.html
....which references:
http://siret.ms.mff.cuni.cz/skopal/pub/acb.pdf
http://www.cbloom.com/news/leoacb.html
http://www.cbloom.com/news/bygeorge.html
Some related info: http://compression-links.info/LinkLi...&Search=Search
Thanks for the post.
Tsk... Classic.
Just slow compressor, but really got my hats off.
Really curious if it would do well on filtered data
from filters available today...
Testing begins.
Quoting Charles Bloom:
---If you see a context like "what the ", then it's highly likely the next sequence is "fu@k"---
Haha.
Here is my take on this algorithm
(Although it is pretty slow it does provide reasonable level of compression)
Some terminology that i use to describe it
Context = the bytes preceding the data being compressed (goes from right to left from current position)
Content = the bytes following the Context (goes from left to right from current position)
Context | Content
Assume a context window holding only pointers into a text buffer(history) and new pointers are added to maintain the sorted order.
That is the pointers are sorted according to the context from current location to be processed. In order to find the next match first it must
identify the associative list of contexts. To do this first find the best matching context from the sorted context window. Then group all the contexts above and below it example if you found a context at 400 then the associative list would be from 300 to 500 (depends on your parameter).
Now sort this list based on content , from this associative list try to find the best content match. This is a bit tricky as you don't actually find the exact string.
Assume we found that our string match falls between 410(matched length of 20) and 411(matched length of 24) from the associative list of contexts.
Now all we have to output is location of match falls in that is 110 (410-300) and the decoder can decode the string based on this.
Decoder does the same thing as encoder first by identifying the list of associative contexts which is 300 to 500 (since best context it finds 400)
Now it gets from input 110 that means the match is at 410 (300+110) . To decode the match it looks at 410 and 411 keeps outputting bytes till they match.
And that is how it can decode without the need for offset , length.
Although latter versions of ACB seem to add few more ideas like if the match length is longer than the mismatched byte from contents.
Then the output would be a (position,miss matched byte,extra length) this is can be improved further perhaps
Thanks for reading![]()
Last edited by Guzzo; 27th August 2011 at 23:54.
This basic idea can be formalized in terms of suffix arrays. Suppose T[i] is the ith character of the text, SA[i] is the text position of the ith suffix, and ISA[i] is the inverse of SA, which gives the index into SA of the suffix beginning at text position i.
One approach is, for each character, find the position of the character that sorts just after it (or before, it doesn't matter). Let's say NextSuff[i] = SA[ISA[i] + 1]. Then, for each character after the first, you can store NextSuff as the delta, NextSuff[i] - NextSuff[i-1]. Most of the time, the delta = 1, so this function is compressible. You can get back the identity of the character, because NextSuff is like a linked list that links all the suffixes in sort order. So if you have the character frequencies, you can fill in the characters by iterating the list. The compression you get from this is not great, but there is plenty of room for tweaks. For one thing, you can look at the suffixes before and after the current one, and if you're inside of a common prefix, you could avoid encoding that character entirely.
A slightly different approach is, rather than outputting the text position of the next sorted suffix, you could output its SA index, as a delta. The function that iterates forward in the SA is usually called psi in the compression research papers. The function that iterates backward (e.g., in the BWT) is LF. psi and LF are inverses of each other. PSI[i] = ISA[SA[i]+1]. PSI takes the current SA index and returns the index of the following text position. To make it compressible, PSI is encoded as the delta. NextSuff[i] is the suffix that sorts after the current one, so you could output ISA[NextSuff[i]+1]-ISA[i+1] as the sort distance to the next suffix (distance from the previous suffix works the same way). Since the decoder won't have seen some of the suffixes yet, you could revise the calculation to omit these, using SA to get the text positions in the intervening range. Reproducing the text from the compressor output would be analogous to reading forward in the SA using psi. You could use the same trick here of omitting the common prefix as in approach #1. Another possible improvement is to compute the SA distance between the two suffixes above and below the new insertion point (delta psi) and use this bound to store the new delta more succinctly (if there is a nonzero gap). This might save bits compared with gamma or delta coding. Arithmetic coding may be a good option using this restricted range.
I played around a little bit with both approaches, but I haven't made a working encoder/decoder yet. I didn't know about ACB before now, so it's useful to know that this style of compressor has been implemented successfully and showed promising results. It sounds like Buyanovsky implemented a lot of the tweaks I had in mind. Using a suffix array as a starting point would almost certainly make it faster, and ought to be significantly easier to experiment with.
Was asked to compile ACB source by @Lucas.
http://nishi.dreamhosters.com/u/acb_v0.rar
ac.c is almost untouched source from http://ctxmodel.net/files/ACB.rar
ac.cpp is processed with a formatter and fixed to work as c++.
Keep in mind that its not the "final" ACB 2.00c (dated 1997), but some early version (dated 1994).