Results 1 to 8 of 8

Thread: Data interpretation techniques

  1. #1
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Location
    Lovinobana, Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts

    Data interpretation techniques

    I´d like to ask if it´s possible to store binary files in text mode without altering filesize. I mean something like this:

    1.Any binary file can be converted to hexadecimal, but that´s doubles the filesize since two hex character represents one byte.
    2.There are many data intepretation techniques (base64, octal etc.), but all of them increasing input filesize.

    So, it´s possible to store computer files in text mode (and convert it back losslessly) without increase in filesize at all?

    Thanks.

  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    611
    Thanks
    246
    Thanked 240 Times in 119 Posts
    Just leave the file as it is and display it using a character set of 256 different characters that you can distinguish. Other than that, the answer is a simple "No." - text mode usually is ASCII which has control codes and characters looking the same.
    But, what's your use case? E.g. analyzing data works fine using a simple hex editor.

    Though a proposal that works for most files would be: compress using cmix and convert to Base85. Decreases file size for all files that cmix can compress by 25% or more
    http://schnaader.info
    Damn kids. They're all alike.

  3. Thanks:

    CompressMaster (1st September 2019)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,830
    Thanks
    287
    Thanked 1,238 Times in 694 Posts
    There's no OS with integrated data conversion, at most there's integrated compression, which can reduce the overhead:
    Code:
    Z:\052>compact /c /f /exe:lzx book1 book1.bit
    
     Compressing files in Z:\052\
    
    book1                  768771 :    335872 = 2.3 to 1 [OK]
    
     Compressing files in Z:\052\
    
    book1.bit             6150168 :    532480 = 11.6 to 1 [OK]
    
    2 files within 2 directories were compressed.
    6,918,939 total bytes of data are stored in 868,352 bytes.
    The compression ratio is 8.0 to 1.
    
    Z:\052>compact *
    
     Listing Z:\052\
     New files added to this directory will not be compressed.
    
       768771 :    335872 = 2.3 to 1 l book1
      6150168 :    532480 = 11.6 to 1 l book1.bit
    Normally you'd either just use binary i/o which is available in all popular programming languages, for example:
    https://stackoverflow.com/questions/...-i-o-in-python

    Or some standalone converter utilities, like "cat input | bit2hex | process | hex2bin > output"

  5. #4
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    What is a text file in your interpretation? "text" has different meanings. Usually: it's a series of human readable characters one after the other with some simple formatting (like line breaks or tabs). A classical text file has a limited range of byte values by forcing them in the (valid) ascii range.

    Any encoding that forces 256 values into a smaller range will always result in a larger file (i.e. it can not fit). For example:

    byte -> bit (2 values), represented in "0" (ascii 0x30) and "1" (ascii 0x31), or O and o (): the result is 8 times larger. I know that you know that.
    byte -> nibble (16 values, i.e. 4 bits), represented in "0"-"F" (i.e. hexadecimal): the result is 2 times larger. You know that as well.
    byte -> base64 (64 values, i.e. 6 bits): the result is ~137% of the original.
    byte -> byte (256 values) is the only representation when the result has the same size as the original. But it is not a "text" file.

    A text file is a series of bytes, where these bytes have some pre-defined interpretation. For example:

    An ascii text file will have less than 128 values.
    Your national character set (and mine too) is ISO-8859-2 (or Latin2). See the wikipedia article of which characters are represented by which bytes. The byte values above 128 have different meaning by the different such standards. Although it has more "valid" values than 128, still not the whole byte range (0-255) is used! See the gray area in the wikipedia article.
    An utf8, utf16 text file can represent more than 256 different characters by using more bytes for such (non-english) characters. Some of these byte sequences are not valid. So you still cannot have the whole byte range in any order.

    As you can see a "text" file has different meanings. But none of them fits in your criteria.

  6. Thanks:

    CompressMaster (2nd September 2019)

  7. #5
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Location
    Lovinobana, Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts
    Thanks Gotty! That´s very useful!

    Quote Originally Posted by Gotty View Post
    What is a text file in your interpretation?
    a series of human readable characters one after the other with some simple formatting (like line breaks or tabs).

    Quote Originally Posted by Gotty View Post
    Any encoding that forces 256 values into a smaller range will always result in a larger file (i.e. it can not fit).
    Oh.

    Quote Originally Posted by Gotty View Post
    An ascii text file will have less than 128 values.
    Your national character set (and mine too) is ISO-8859-2 (or Latin2). See the wikipedia article of which characters are represented by which bytes. The byte values above 128 have different meaning by the different such standards. Although it has more "valid" values than 128, still not the whole byte range (0-255) is used! See the gray area in the wikipedia article.
    ISO-8859-1: 8 bits. 256 code points
    I thought that that´s the best encoding, but some characters are non-printable. Therefore it cannot be expressed with unique characters.
    Seems to be there isn´t encoding (apart from Unicode which is useless due more bytes used) that fits my specifications.

    Quote Originally Posted by Gotty View Post
    An utf8, utf16 text file can represent more than 256 different characters by using more bytes for such (non-english) characters. Some of these byte sequences are not valid. So you still cannot have the whole byte range in any order.
    That´s the problem. So I´ll stuck with hexadecimal. Or...?
    Last edited by CompressMaster; 2nd September 2019 at 21:09. Reason: forgot to add quote tags

  8. #6
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    Quote Originally Posted by CompressMaster View Post
    That´s the problem. So I´ll stuck with hexadecimal. Or...?
    Binary and hexadecimal are the two common representations to make bytes human-readable. That's the reason they exist. They are not for storing.
    When you have a byte (a number between 0-255), like 65, its hexadecimal on-screen (!) representation is 41 (2 chars) and its binary on-screen (!) representation is 01000001 (8 chars). If this byte also represents the ascii letter "A" (one character, on screen, again!). They all mean the same -> they have the same information. You can convert between them: like you did in school with the different numeral systems (Číselná_sústava). Binary is base 2, hexadecimal is base 16, and byte is base 256.

    When you open a binary file in a "hex viewer" and you set it to display the content as text or utf-8 or hexadecimal, you will see the same content in its different interpretations. If the file contains words and sentences, you will naturally want to see it as text - and you can read and will understand the content. But when it is a binary file (like the results of coin flips stored in bits -8 coin flips per byte) - you will need to switch to binary and see something like 00100101 11011001 ... And you will know the results of the flips. Actually hexadecimal is just a shorter (again: on-screen) form of binary - programmers can easily see the 4+4 bits in those two hexadecimal numbers. Like in the Matrix movie.

    Hexadecimal and binary are only for humans so that we can make some sense of non-printable bytes. For example I cannot type here the bell character. You can not see it. It's non-printable (but audible in some systems!) However I can show you its hexadecimal representation: 0B, and binary: 00001011, because they are always printable. See? I can not show you the bell "byte", only in hexadecimal and binary. Hexadecimal and binary are for us, humans, so we can read and interpret non-printable characters (and other codes like coin flips).

    When storing any data always go with bytes. You don't want to store data in hexadecimal or binary.
    If you are thinking in compression - the best is the byte representation. Byte is for computers, hexadecimal and binary are for humans.

    See also the wikipedia article
    Quote: "Hexadecimal numerals are widely used by computer system designers and programmers, as they provide a more human-friendly representation of binary-coded values". That's a good summary of my post
    Last edited by Gotty; 4th September 2019 at 01:11.

  9. Thanks:

    CompressMaster (4th September 2019)

  10. #7
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by CompressMaster View Post
    I´d like to ask if it´s possible to store binary files in text mode without altering filesize. I mean something like this:

    1.Any binary file can be converted to hexadecimal, but that´s doubles the filesize since two hex character represents one byte.
    2.There are many data intepretation techniques (base64, octal etc.), but all of them increasing input filesize.

    So, it´s possible to store computer files in text mode (and convert it back losslessly) without increase in filesize at all?

    Thanks.
    I'm not sure if this is what you're looking for or not, but when I was 13 years old (around 1993) one of my first text compression projects was to send and store binary data as a text message to share with others via teletype, with Wildcat Bulletin Board Systems (BBS), and Prodigy (before it was an internet service in the late 1990's, it was a pre-email service, like Compuserve, that existed for DOS and other platforms).

    At the time, there were no flash drives, no CD burners for the public (if there were any they were thousands of dollars still), and hardly any ZIP drive technology.

    5 1/4 floppies were popular still and 3.5" floppies had only started to become big, which meant that it was easier to lose data.

    I had thought of printing dots as binary to paper to preserve data on disks, but that didn't help for digital transmissions, and not everyone had a scanner or even a hand scanner to be able to read it back.

    It had to be something more compact than base64/binhex and entirely human-readable and decodeable while approaching the same size (or smaller) as the original text, but did not need to be quite as good as LZW or other LZ methods with a hybrid form of compression because speed was still important. It had to save data either as text, sent as text, or saved as and printed to a piece of paper that worked with a dot matrix or daisywheel printer.

    It had to be fast, and able to work in any dialect of a modern basic language (and be able to port to PROLOG, LISP, TURBO PASCAL, OR C easily).

    What I came up with was WasilHex 3.2.

    It converted the data to hexidecimal, then to a 3.5 bit code (a little like a rice-coded decimal but different), which was then grouped together a 2 digits, and used a 92 character printable table to handle up to 28 more characters than binhex while implementing a wildcard digit for compression (the number 9).

    It used a 3 byte header to reserve the most common occurrences for text symbols, and compressed 2:1 any time those were found which (in most of the binary data like com and exe files, text files, and wav files during those days) happened to occur a lot.

    The result was an entirely human-readable, human decodeable-by-hand (if necessary) compressed data format and output which could be written entirely with text and printable characters, and those could be compressed further by PKZIP, ARC, or anything else you decided to use on it if you needed to.

    Even without that, it usually guaranteed that data would usually be smaller than the original anywhere from 25% to 33% with high redundancies.

    I did repurpose it briefly between 2005-2011 to send binary data to friends with a cell phone and a small app that would organize it as packets, and then convert it back to binary data on a phone's storage device. (Around 2005, it was used to send exes, mp3 files, and zip files various ways as text messages and other data to fax machines and as email attachments that would not be recognized or flagged by providers).

    It was a way to send binary data over a mobile network that only permitted texts to be sent in short blocks due to limited bandwidth over cellular networks at the time.

    I didn't really want to throw a license on it, but I was asked to at the time and did a quasi-license which, if anyone really needed to use it, I wouldn't have tripped on or made an issue if they did, but was told I needed one. I wanted to convert books at the library to it to save space on pages and make a more human-readable universal language, but I was pretty sure people didn't want to do math every time they read or decipher it and I more or less abandoned it around 1998.

    It was made more to solve problems and make data portability and sustainability easier at the time (most or all of which is no longer necessary with today's technology and even technology 10 years ago).

    There might still be some uses for it today if anyone wants to play around with it. It won't always represent data smaller or the same size as the original, but it will compress some things and represent it as entirely printable text, which may be the closest thing to what you were after or looking to do.

    Normally you're only going to get a 1:1 or expansion if you're converting binary to text, and a 2:1 if converting from BINARY (ASCII 0 to 255) to HEX (symbols 00 to FF), meaning for every byte you convert, you're going to get 2 HEX symbols. Binhex usually fell between this to 1.5 symbols, while WasilHex 3.2 will give you anywhere from .5 of a symbol to 1.25 of a symbol on average, based on how often things repeat.

    I still have the original Qbasic source code for it from years ago (and converted it to Turbo Basic and Visual Basic 3.0 around 1996, then QB64 and Freebasic around 2010 to 2011), but I posted a public google blog about it around 2011 to share it if anyone needs it and finds it useful for something they're doing.

    If you need to get data to or from a serial connection that only reads 7 bit ascii or text only and want to save a little time and space doing it, it still works for that as well as any type of printed storage medium.

    Here are the details and layout: https://sites.google.com/site/qbasic...hex-3-2-format

    P.S: When it was converted to VB6 around 2010, I added a program that wrote basic code to decode whatever you compressed or converted with it. The only thing you had to do was add the output code to your program and you'd be able to convert it from wasilhex32 back to regular ascii or binary again. Edited the post to attach that here if you wanted it.
    Last edited by JamesWasil; 16th January 2020 at 13:12. Reason: Added program

  11. #8
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    File for conversion:
    Attached Files Attached Files

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 244
    Last Post: 23rd March 2020, 16:33
  2. Modelling techniques
    By Shelwien in forum Data Compression
    Replies: 16
    Last Post: 23rd December 2019, 12:57
  3. Compression techniques that don't require repetition
    By Bundle in forum Data Compression
    Replies: 15
    Last Post: 17th June 2019, 04:51
  4. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •