Results 1 to 6 of 6

Thread: Data interpretation techniques

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    132
    Thanks
    38
    Thanked 7 Times in 7 Posts

    Data interpretation techniques

    I´d like to ask if it´s possible to store binary files in text mode without altering filesize. I mean something like this:

    1.Any binary file can be converted to hexadecimal, but that´s doubles the filesize since two hex character represents one byte.
    2.There are many data intepretation techniques (base64, octal etc.), but all of them increasing input filesize.

    So, it´s possible to store computer files in text mode (and convert it back losslessly) without increase in filesize at all?

    Thanks.

  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Just leave the file as it is and display it using a character set of 256 different characters that you can distinguish. Other than that, the answer is a simple "No." - text mode usually is ASCII which has control codes and characters looking the same.
    But, what's your use case? E.g. analyzing data works fine using a simple hex editor.

    Though a proposal that works for most files would be: compress using cmix and convert to Base85. Decreases file size for all files that cmix can compress by 25% or more
    http://schnaader.info
    Damn kids. They're all alike.

  3. The Following User Says Thank You to schnaader For This Useful Post:

    CompressMaster (1st September 2019)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 984 Times in 510 Posts
    There's no OS with integrated data conversion, at most there's integrated compression, which can reduce the overhead:
    Code:
    Z:\052>compact /c /f /exe:lzx book1 book1.bit
    
     Compressing files in Z:\052\
    
    book1                  768771 :    335872 = 2.3 to 1 [OK]
    
     Compressing files in Z:\052\
    
    book1.bit             6150168 :    532480 = 11.6 to 1 [OK]
    
    2 files within 2 directories were compressed.
    6,918,939 total bytes of data are stored in 868,352 bytes.
    The compression ratio is 8.0 to 1.
    
    Z:\052>compact *
    
     Listing Z:\052\
     New files added to this directory will not be compressed.
    
       768771 :    335872 = 2.3 to 1 l book1
      6150168 :    532480 = 11.6 to 1 l book1.bit
    Normally you'd either just use binary i/o which is available in all popular programming languages, for example:
    https://stackoverflow.com/questions/...-i-o-in-python

    Or some standalone converter utilities, like "cat input | bit2hex | process | hex2bin > output"

  5. #4
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    381
    Thanks
    261
    Thanked 269 Times in 145 Posts
    What is a text file in your interpretation? "text" has different meanings. Usually: it's a series of human readable characters one after the other with some simple formatting (like line breaks or tabs). A classical text file has a limited range of byte values by forcing them in the (valid) ascii range.

    Any encoding that forces 256 values into a smaller range will always result in a larger file (i.e. it can not fit). For example:

    byte -> bit (2 values), represented in "0" (ascii 0x30) and "1" (ascii 0x31), or O and o (): the result is 8 times larger. I know that you know that.
    byte -> nibble (16 values, i.e. 4 bits), represented in "0"-"F" (i.e. hexadecimal): the result is 2 times larger. You know that as well.
    byte -> base64 (64 values, i.e. 6 bits): the result is ~137% of the original.
    byte -> byte (256 values) is the only representation when the result has the same size as the original. But it is not a "text" file.

    A text file is a series of bytes, where these bytes have some pre-defined interpretation. For example:

    An ascii text file will have less than 128 values.
    Your national character set (and mine too) is ISO-8859-2 (or Latin2). See the wikipedia article of which characters are represented by which bytes. The byte values above 128 have different meaning by the different such standards. Although it has more "valid" values than 128, still not the whole byte range (0-255) is used! See the gray area in the wikipedia article.
    An utf8, utf16 text file can represent more than 256 different characters by using more bytes for such (non-english) characters. Some of these byte sequences are not valid. So you still cannot have the whole byte range in any order.

    As you can see a "text" file has different meanings. But none of them fits in your criteria.

  6. The Following User Says Thank You to Gotty For This Useful Post:

    CompressMaster (2nd September 2019)

  7. #5
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    132
    Thanks
    38
    Thanked 7 Times in 7 Posts
    Thanks Gotty! That´s very useful!

    Quote Originally Posted by Gotty View Post
    What is a text file in your interpretation?
    a series of human readable characters one after the other with some simple formatting (like line breaks or tabs).

    Quote Originally Posted by Gotty View Post
    Any encoding that forces 256 values into a smaller range will always result in a larger file (i.e. it can not fit).
    Oh.

    Quote Originally Posted by Gotty View Post
    An ascii text file will have less than 128 values.
    Your national character set (and mine too) is ISO-8859-2 (or Latin2). See the wikipedia article of which characters are represented by which bytes. The byte values above 128 have different meaning by the different such standards. Although it has more "valid" values than 128, still not the whole byte range (0-255) is used! See the gray area in the wikipedia article.
    ISO-8859-1: 8 bits. 256 code points
    I thought that that´s the best encoding, but some characters are non-printable. Therefore it cannot be expressed with unique characters.
    Seems to be there isn´t encoding (apart from Unicode which is useless due more bytes used) that fits my specifications.

    Quote Originally Posted by Gotty View Post
    An utf8, utf16 text file can represent more than 256 different characters by using more bytes for such (non-english) characters. Some of these byte sequences are not valid. So you still cannot have the whole byte range in any order.
    That´s the problem. So I´ll stuck with hexadecimal. Or...?
    Last edited by CompressMaster; 2nd September 2019 at 21:09. Reason: forgot to add quote tags

  8. #6
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    381
    Thanks
    261
    Thanked 269 Times in 145 Posts
    Quote Originally Posted by CompressMaster View Post
    That´s the problem. So I´ll stuck with hexadecimal. Or...?
    Binary and hexadecimal are the two common representations to make bytes human-readable. That's the reason they exist. They are not for storing.
    When you have a byte (a number between 0-255), like 65, its hexadecimal on-screen (!) representation is 41 (2 chars) and its binary on-screen (!) representation is 01000001 (8 chars). If this byte also represents the ascii letter "A" (one character, on screen, again!). They all mean the same -> they have the same information. You can convert between them: like you did in school with the different numeral systems (Číselná_sústava). Binary is base 2, hexadecimal is base 16, and byte is base 256.

    When you open a binary file in a "hex viewer" and you set it to display the content as text or utf-8 or hexadecimal, you will see the same content in its different interpretations. If the file contains words and sentences, you will naturally want to see it as text - and you can read and will understand the content. But when it is a binary file (like the results of coin flips stored in bits -8 coin flips per byte) - you will need to switch to binary and see something like 00100101 11011001 ... And you will know the results of the flips. Actually hexadecimal is just a shorter (again: on-screen) form of binary - programmers can easily see the 4+4 bits in those two hexadecimal numbers. Like in the Matrix movie.

    Hexadecimal and binary are only for humans so that we can make some sense of non-printable bytes. For example I cannot type here the bell character. You can not see it. It's non-printable (but audible in some systems!) However I can show you its hexadecimal representation: 0B, and binary: 00001011, because they are always printable. See? I can not show you the bell "byte", only in hexadecimal and binary. Hexadecimal and binary are for us, humans, so we can read and interpret non-printable characters (and other codes like coin flips).

    When storing any data always go with bytes. You don't want to store data in hexadecimal or binary.
    If you are thinking in compression - the best is the byte representation. Byte is for computers, hexadecimal and binary are for humans.

    See also the wikipedia article
    Quote: "Hexadecimal numerals are widely used by computer system designers and programmers, as they provide a more human-friendly representation of binary-coded values". That's a good summary of my post
    Last edited by Gotty; 4th September 2019 at 01:11.

  9. The Following User Says Thank You to Gotty For This Useful Post:

    CompressMaster (4th September 2019)

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 221
    Last Post: 6th October 2019, 17:29
  2. Compression techniques that don't require repetition
    By Bundle in forum Data Compression
    Replies: 15
    Last Post: 17th June 2019, 04:51
  3. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01
  4. Modelling techniques
    By Shelwien in forum Data Compression
    Replies: 14
    Last Post: 1st June 2008, 23:15

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •