Results 1 to 9 of 9

Thread: Compression idea: Base conversion

  1. #1
    Member Nightgunner5's Avatar
    Join Date
    Sep 2009
    Location
    Wisconsin, USA
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Lightbulb Compression idea: Base conversion

    Since most files (especially plain text) don't use all 256 possible bytes, it should be possible to take all of the bytes that are used, remove unused byte values, so abd (97, 98, 100) would become 0, 1, 2, and then change the base from the number of unique bytes in the file to 256. That means abd (97, 98, 100) would compress to a single byte "21" along with a header that lists the byte translations. It wouldn't compress 4 character files very well, but a file with repeated text or just natural English would compress very well.

    I tried to program a compressor for this, but I found out I'd need a 2048-bit integer to be able to do this the way I think would be the easiest.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    int cmp( const void* a, const void* b ) {
    	return ( *(char*)a - *(char*)b );
    }
    
    void make_header( char* bytes, int count, int size, FILE* o ) {
    	int i;
    	char tmp;
    
    	for ( i = 0; i < count; i++ ) {
    		if ( bytes[i] != '\0' && bytes[i] != '-' ) {
    			if ( i + 1 == count )
    				fwrite( &bytes[i], 1, 1, o );
    			else if ( bytes[i] == bytes[i - 1] + 1 && bytes[i] != bytes[i + 1] - 1 )
    				fwrite( &bytes[i], 1, 1, o );
    			else if ( bytes[i] != bytes[i - 1] + 1 && bytes[i] == bytes[i + 1] - 1 ) {
    				fwrite( &bytes[i], 1, 1, o );
    				fwrite( "-", 1, 1, o );
    			}
    		} else {
    			if ( i + 1 == count ) {
    				fwrite( "\0", 1, 1, o );
    				fwrite( &bytes[i], 1, 1, o );
    			} else if ( bytes[i] == bytes[i - 1] + 1 && bytes[i] != bytes[i + 1] - 1 ) {
    				fwrite( "\0", 1, 1, o );
    				fwrite( &bytes[i], 1, 1, o );
    			} else if ( bytes[i] != bytes[i - 1] + 1 && bytes[i] == bytes[i + 1] - 1 ) {
    				fwrite( "\0", 1, 1, o );
    				fwrite( &bytes[i], 1, 1, o );
    				fwrite( "-", 1, 1, o );
    			}
    		}
    	}
    
    	fwrite( "\0\1", 1, 2, o );
    
    	i = 4;
    	while ( i-- > 0 ) {
    		tmp = ( size >> ( 8 * i ) ) & 0xFF;
    		fwrite( &tmp, 1, 1, o );
    	}
    }
    
    int main( int argc, char** argv ) {
    	FILE* f = fopen( "1.txt", "rb" );
    	if ( !f ) {
    		perror( "Failed to open input file" );
    		exit( 1 );
    	}
    
    	FILE* o = fopen( "1.bin", "wb" );
    	if ( !o ) {
    		perror( "Failed to open output file" );
    		exit( 1 );
    	}
    
    	char bytes[256] = {0},
    		 revbytes[256] = {0},
    		 curbyte[1] = {0};
    	int numbytes = 0,
    		i = 0;
    
    	while ( fread( curbyte, 1, 1, f ) ) {
    		for ( i = 0; i < numbytes; i++ )
    			if ( bytes[i] == curbyte[0] ) {
    				i = 257;
    				break;
    			}
    
    		if ( i != 257 ) {
    			bytes[numbytes] = curbyte[0];
    			numbytes++;
    		}
    	}
    
    	qsort( bytes, numbytes, sizeof( char ), cmp );
    
    	fseek( f, 0, SEEK_END );
    
    	make_header( bytes, numbytes, ftell( f ), o );
    
    	rewind( f );
    
    	for ( i = 0; i < numbytes; i++ )
    		revbytes[(int)bytes[i]] = i;
    
    	// We need a 2048-bit integer to continue past this point easily.
    
    	return 0;
    }

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Any arithmetic coder with uniform probability distribution
    would give you a very similar result - not the precise
    representation of the data using a different numeral system,
    but practically the same file size.

    Also there's even an arithmetic coder that does the reverse
    thing - encodes any binary data using a specified base-b
    numeral system :).

    Its the right direction though - just collect statistics
    for strings instead of bytes, and take probabilities into
    account instead of only presence/absence of something,
    and you'd have a CM compressor ;)
    Last edited by Shelwien; 28th October 2009 at 03:27.

  3. #3
    Member Nightgunner5's Avatar
    Join Date
    Sep 2009
    Location
    Wisconsin, USA
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't understand - If you add all of the letters, numbers, and symbols on a keyboard, you still have less than 100. Each byte has room for 256, so a normal text file could be compressed to about a little more than a third of its original size.

    How is about a third of the original size not good compression?

    My compression algorithm doesn't use any special logic, it just converts to a smaller number system.
    Last edited by Nightgunner5; 29th October 2009 at 20:37.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Nightgunner5 View Post
    I don't understand - If you add all of the letters, numbers, and symbols on a keyboard, you still have less than 100. Each byte has room for 256, so a normal text file could be compressed to about a little more than a third of its original size.
    and sequence of chars '0' and '1' should be compressible to 1/128 of original?

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Quote Originally Posted by Nightgunner5 View Post
    I don't understand - If you add all of the letters, numbers, and symbols on a keyboard, you still have less than 100. Each byte has room for 256, so a normal text file could be compressed to about a little more than a third of its original size.

    How is about a third of the original size not good compression?

    My compression algorithm doesn't use any special logic, it just converts to a smaller number system.

    It's not correct logic.
    If you have approximately 100 symbols, then, you'll need approximately 7 bits per char...

  6. #6
    Member Nightgunner5's Avatar
    Join Date
    Sep 2009
    Location
    Wisconsin, USA
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    and sequence of chars '0' and '1' should be compressible to 1/128 of original?
    No, it would be 1/8 of the size.

    Quote Originally Posted by Cyan View Post
    It's not correct logic.
    If you have approximately 100 symbols, then, you'll need approximately 7 bits per char...
    Don't think about it as a stream of bytes, think of it as a really big number. 100 symbols is base 100, and a 1000 digit base 100 number is about 700 digits of base 256. (if my math is right)

  7. #7
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Nightgunner5 View Post
    Don't think about it as a stream of bytes, think of it as a really big number. 100 symbols is base 100, and a 1000 digit base 100 number is about 700 digits of base 256. (if my math is right)
    That doesn't change much. You can compress it better than 7 bit per byte (for base 12 that way, but you will only get it down to log(100)/log(2)=6,64 bit/byte even with the best base conversion or arithmetic compression. By the way, you're saying the same because 1000 digits (=bytes) in base 100 get to 700 digits (=bytes) in base 256. That's not a third, but 70% of the original size (or to be exact the mentioned 66,4%).
    http://schnaader.info
    Damn kids. They're all alike.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Ok, here's something more clear:

    Code:
    Z:\ng.exe BOOK1
    Collecting stats. Done.
    File length = 768771, alphabet size = 82
    1. Arithmetic code estimate: 435161.9 bytes
    2. Base-82 code size estimate: 29.5+610937.7=610967.2
    "29.5" is arithmetic code length for the alphabet bitmask.

    Also, I'd repeat: your idea is not wrong anywhere, just that
    modern statistical compressors already use that (and much more).
    Attached Files Attached Files

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Quote Originally Posted by Nightgunner5 View Post
    If you add all of the letters, numbers, and symbols on a keyboard, you still have less than 100. Each byte has room for 256, so a normal text file could be compressed to about a little more than a third of its original size.
    This version is not correct.
    How many characters would you then store into a single number base 65536 such as an unsigned short ? 655 characters base 100 ?


    Quote Originally Posted by Nightgunner5 View Post
    Don't think about it as a stream of bytes, think of it as a really big number. 100 symbols is base 100, and a 1000 digit base 100 number is about 700 digits of base 256. (if my math is right)
    1000 digit base 100 number = 10^(2*1000) ~ 10^(3*666) ~2^(10*666) ~2^(8*833) ~256^833 ==> compression to 83% of initial size.

    Yes, this version sounds correct.
    Now the compression rate is not the same...


    My piece of advise : Focus a bit more on decoding, you'll get some idea
    Last edited by Cyan; 30th October 2009 at 12:24.

Similar Threads

  1. Idea: Combine Compression & Encryption
    By dirks in forum Data Compression
    Replies: 16
    Last Post: 22nd February 2010, 11:49
  2. Idea for raising compression efficiency on disk images
    By Mexxi in forum Data Compression
    Replies: 10
    Last Post: 18th February 2010, 05:56
  3. Idea to make new site about data compression
    By Piotr Tarsa in forum Data Compression
    Replies: 1
    Last Post: 14th August 2009, 21:22
  4. Idea - Don't flame me pls...
    By rubendodge in forum Data Compression
    Replies: 7
    Last Post: 24th April 2009, 20:44
  5. New Idea for Hybrid 7-Zip Archiver
    By DeathTheSheep in forum Data Compression
    Replies: 22
    Last Post: 30th December 2008, 22:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •