Results 1 to 5 of 5

Thread: Programming questions

  1. #1
    Member
    Join Date
    Jul 2020
    Location
    Guatemala
    Posts
    26
    Thanks
    2
    Thanked 1 Time in 1 Post

    Programming questions

    Just a simple question i hope.
    If you had to store these numbers in a file how many bits would it take. I know java has metrics to indicate the size.... if I'm not mistaken it's below and above 256, but I'm not a programmer. So this leads me to ask if the number size is a range from 0-1000000 and you could not sort the numbers below and above, wouldn't that mean you have to allocate all to the max size?

    Example

    2,1238,47659423,76,24,395,...
    Last edited by Hcodec; 22nd July 2020 at 07:34.

  2. #2
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    If you store these numbers and commas as text (as is), it will take 8 bits per character (digits and commas).

    If you would like to store them in fixed size binary numbers, then yes, the largest possible number will decide the number of bytes you need per number. If all your numbers fit in the 32-bits range (0..4294967295) then 4 bytes per number is the most comfortable size, because it's native to most programming languages (on the 32-bit and 64-bit platforms). Writing and reading a 32-bit integer to a byte stream is supported here and there. In c# it's BinaryWiter and BinaryReader.ReadInt32, in Java its DataOutputStream.writeInt and DataInputStream.readInt.

    If you would like to spare storage space, and the numbers are in the 24-bit range (0..16777215) then 3 bytes per number would be the way to go but then you would need to disassemble and assembly the integers per 3 bytes.

    If your numbers are in the 0-1000000 range and you would like to spare even more space, you can do 20 bits per number because 20 bits cover the range of 0..1048575. But then your numbers would not always start and end on byte boundaries, so you need to write and read your file not byte by byte but bit by bit. If your last number does not end on a byte boundary then that last byte would be padded by zero bits at its tail for example. (You can store whole bytes in a file, not individual bits.)

    If you really want to spare the most space, you will need a more sophisticated encoding-decoding mechanism: you can use a range coder (or any equivalent coder). If your numbers are in the 0-1000000 range and equally probable, you will need log2(1000001) = 19.93157001 bits per number.
    Notice that this calculation is valid to the 32-bit (4 byte) and 24-bit (3-byte) cases I mentioned above since log2(4294967296)=32 and log2(16777216)=24.

    If your numbers are not equally probable, and you know their individual probabilities, than you can go below 19.93... bits per number. Otherwise 19.93.. bits per number is the limit.

  3. Thanks (2):

    Hcodec (22nd July 2020),JamesWasil (23rd July 2020)

  4. #3
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    103
    Thanks
    96
    Thanked 18 Times in 17 Posts
    Quote Originally Posted by Hcodec View Post
    Just a simple question i hope.
    If you had to store these numbers in a file how many bits would it take. I know java has metrics to indicate the size.... if I'm not mistaken it's below and above 256, but I'm not a programmer. So this leads me to ask if the number size is a range from 0-1000000 and you could not sort the numbers below and above, wouldn't that mean you have to allocate all to the max size?

    Example

    2,1238,47659423,76,24,395,...
    Hi Hcodec,

    There's a few different approaches you can use for this. The first (which works but is not very efficient because it uses an extra bit) is to encode every digit to a nibble. Basically, digits that are 0 to 9 will be:
    0000 = 0
    0001 = 1
    0010 = 2
    0011 = 3
    0100 = 4
    0101 = 5
    0110 = 6
    0111 = 7
    1000 = 8
    1001 = 9

    Now of course you can see that this is wasteful even though it is a fast and easy way to represent the digits "close enough" and make them easy to read or write from whole bytes since they're 1/2 of a byte. A more efficient way would be this:

    000 = 0
    001 = 1
    010 = 2
    011 = 3
    100 = 4
    101 = 5
    1100 = 6
    1101 = 7
    1110 = 8
    1111 = 9

    This is more efficient because you're only going to have an extra bit for digits 6 to 9. If your digits are from 0 to 5 mostly, you'll actually save a little, but every time 6 to 9 are seen, it'll be a little larger than it needs to be.

    The most efficient way would be to store all the digits together if you know how many digits there are. Usually base 2^X will give you what you need and be efficient without using any extra bits to store digits IF you know EXACTLY how many digits there are that you need.

    2^0 = 1
    2^1 = 2
    2^2 = 4
    2^3 = 8
    2^4 = 16
    2^5 = 32
    2^6 = 64
    2^7 = 128
    2^8 = 256
    2^9 = 512
    2^10 = 1024
    2^11 = 2048
    2^12 = 4096
    2^13 = 8192
    2^14 = 16384
    2^15 = 32768
    2^16 = 65536
    2^17 = 131072
    2^18 = 262144
    2^19 = 524288
    2^20 = 1048576
    ...

    Now IF you know what your digit groups will be, you can represent it very compactly by fitting it to one of the ranges above. If you have 3 digits for example, you can fit it snugly with 10 bits, since your 3 digit groups for 000 to 999 can be safely contained by 0 to 1023 (1 to 1024), with 24 slots to spare.

    If your digits are 6 at a time, you can fit them from 000000 to 999999 which safely fits with 2^20 = 1048576. You'll have 48576 left over slots, but you'll be able to store all of your digits without using a bit extra over what you need.

    Now if you were to try and store those digits 2 at a time, you'd have to use 2^7 = 128, which is 7 bits (2^bits needed), because your digits would go from 00 to 99, with 28 extra.

    If you were to do 4 digits at a time, you'd run out if you tried 2^13 or 13 bits, because it handles 0000 to 8192, and the extra 1808 digit slots you need wouldn't be there unless you went up a level to 2^14 to store it with 14 bits.

    Now if you had 2 digits for example and stored them the way on the last paragraph above, you would be using 3.5 bits per digit rather than 3.22. How do I know? Because 7 bits (what is needed to store those 2 digits) divided by 2 is 3.5 bits per digit.

    Is it better then to store 2 digits in a sequence of 3 if you can? Yes. It's more efficient because you'll be able to store them as 3.33 bits per symbol rather than 3.5 bits per symbol. 1048576 = 2^20 = 20 bits which handles 00-99, 00-99, 00-99 or 000000 to 999999. And 20 bits / 6 digits = 3.33 bits per symbol.

    Notice that this is the same as using 1024 or 2^10 as 10 bits to store 3 digits from 000 to 999. You have the same 3.33 bits per symbol that way, too. Using the 20 bit field gives you the opportunity to use the extra space to go by 2's rather than 3's for digit groupings.

    When you study information theory, you'll find that there is a way to determine the entropy of symbols and digits. The entropy of a base 10 decimal digit is said to be 3.22 bits per symbol, based on the logarithm of it. But it's not always easy to actually STORE it at entropy. There are other ways that you can approach it (or even go slightly under it) like with arithmetic encoding, but I've found that if you're doing it by hand or in a hurry, the fastest way is usually to group it and assume 3.33 bits per symbol if you're using 0 to 9 digits as your range for every digit.

    That said, there are different times when the entropy of a digit will be different, and here's a way I've found that you can determine it fast.

    Divide the number of digits from a max field by how many bits are required to store it.

    For example, let's say you're using order 1 which is 2 bytes, 000 to 255 + 000 to 255 which when grouped together gives you a decimal range of 00000 to 65535. (65536 possibilities)

    Now, you might notice that your entropy is different here, but let's see why. First, your first digit will always be in the range of 0 to 6, which means that 7,8, and 9 are never used. You might notice too that when it is a 6 digit, the digit after it is always 0 to 5 and never 6,7,8, or 9. This means that we aren't using all the maximum space of a base 10 digit, and as such, we can store this as less than 3.33 bits per symbol.

    65536 = 5 digits. 2^16 = 65536, 16 bits are required to store this number.

    16 / 5 = 3.20 bits per symbol are all that we need IF our number satisfies the conditions where the first digit is 0 to 6, the second digit is 0 to 9 only if the first digit is 0 to 5 (and is 0 to 5 if the first digit is 6), and the other 3 digits are able to be any range from 0 to 9.

    Now if you had 4 digits and knew that the number needed to be any range, but that the first digit would always be under 8, you could get away with this:

    8192 = 4 digits. 2^13 = 8192, 13 bits are required to store this number.

    13 / 4 = 3.25 bits per symbol. This is smaller than 3.5 bits per symbol by far, and smaller than our max digit range of 3.33 bits per symbol, but still a little larger than what entropy of a digit is supposed to be at 3.22 bits per symbol.

    People have tried to use bit flags and other methods to specify ahead of time which type of field is needed to where a group of digits will fit more compactly, but often times the extra bit flag or bits needed as a header to do that will make the number larger than having a static way to store it as-is.

    Now if you wanted to store these digits in a file, let's assume that your program or purpose to do this is going to need the full range of the digits from 0 to 9 and not have time to see if they fit to a 16 bit field, a 13 bit field, a 20 bit field, or anything else to try and save partial pieces of a bit.

    My recommendation would be to store your digits 3 at a time if you can, and use a 1 bit header to know if it ends with an odd or even digit, since it stores by 3's.

    That would be 2^10 or 10 bits, which handles 000 to 999, and still leaves 24 slots available. This would be 3.33 bits per symbol, but another way that you can get that to 3.22 bits and under is to use those extra 24 spaces. When you convert 999 to binary, your 10 bit number is 1111100111.

    This means that to store digits from 000 to 999, the 10 bits you use to do this will go from 0000000000 to 1111100111. It will never exceed that, either.

    You're able to use the space from 1111100111 to 1111111111 to compress the most frequent digits to fewer bits.

    You could, for example, use 111110100 to 111111111 to compress the 12 groups that are 3 digits which occur the most to 9 bits rather than 10, and that would give you 3 bits per symbol for those patterns which is smaller than entropy, rather than 3.33 bits per symbol like the rest.

    Or, you could use the 24 spaces for a type of binary LZ algorithm if you wanted to, to be able to compress the digits by fractional pieces which are still evenly represented as bits and packaged as bytes to a file to get it under 3.22 bits per symbol on many of the digits or closer to it.

    You have a lot of different choices with this but yeah, if you need a quick and easy way to store digits and it's alright to be bulky with an extra bit per symbol, you can use a nibble of 4 bits to store 0 to 9, or if you want to store more efficiently and can group it by 3 or 6 digits at a time, you can store as 2^10 or 2^20. If you're able to model the first or second digit, you can get away with a 13 or 16 or other bit field as-needed.

    Another thing too, if you use a nibble you might still be able to compress frequently seen digits even still, because bits from 1010, 1011, 1100, 1101, 1110, and 1111 that are unused can represent either an RLE (run-length encoding) of your digits from 2,3,4,5 or 6 repetitions per digit, or you can use those as prefix nibbles for LZ compression of a dictionary of digits (possibly getting it between 2.4 to 3.2 bits per symbol then based on the data), or you can use them as digit-based symbol ranking to where 1010 represents a group of 2 digits that occur the most, 1011 represent a group of 2 digits that occur next, etc. That would give you 6 slots for the 2 digits that occurred the most to be compressed to 2 bits per symbol then.

    You have choices. This may be more than what you needed, but I had time to write tonight. If any part of this needs clarification, let me know and I'll explain it further, but I'm pretty sure you should see everything you need here to be able to store digits as binary and convert it to 8 bit ASCII for file storage regardless what language you decide to use.

    James

    P.S: To store your example, if those were the only digits you needed to store to a file and the program could be made aware of the fields for max numbers per group, I would go with this:

    2,1238,47659423,76,24,395

    2^2, 2^11, 2^26, 2^7, 2^5, 2^9 =

    00 00000000000 00000000000000000000000000 0000000 00000 000000000
    to
    11 11111111111 11111111111111111111111111 1111111 11111 111111111

    60 bits. This entire number is able to fit under 64 bits, which is 8 bytes.

    If your program isn't able to know the digit spaces ahead of time and they had to be kept separate, then you can either make the program read the digits as you've separated them by commas and make it aware of them, that way you can recreate it but store it as 21238476594237624395 which would be 20 digits * 3.33 bits = 66.6 bits to store it. You can get it a little smaller than this but that's an estimated figure. If you try to use flags and digit headers, you're going to exceed this upward of 70 bits to store the same number. And of course, if you're storing the result as 8 bit ASCII you're going to have to round it up to 72 bits and store it as 9 bytes of 8 bits and ignore the data after the 67th bit.

  5. Thanks:

    Hcodec (22nd July 2020)

  6. #4
    Member
    Join Date
    Jul 2020
    Location
    Guatemala
    Posts
    26
    Thanks
    2
    Thanked 1 Time in 1 Post
    Great information, thanks to both! I wouldn't think there would be too many cases in real world where data comes in a truly random format where the min and max size of the bytes are extremely different. For instance phone data, passwords, hashes, etc are random but the size is quite constant. But, good to know how it is handled when it is. A wealth of information and time spent on the answers....Thanks!
    Last edited by Hcodec; 22nd July 2020 at 18:54.

  7. Thanks:

    JamesWasil (22nd July 2020)

  8. #5
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    103
    Thanks
    96
    Thanked 18 Times in 17 Posts
    Quote Originally Posted by Hcodec View Post
    Great information, thanks to both! I wouldn't think there would be too many cases in real world where data comes in a truly random format where the min and max size of the bytes are extremely different. For instance phone data, passwords, hashes, etc are random but the size is quite constant. But, good to know how it is handled when it is. A wealth of information and time spent on the answers....Thanks!

    Very welcome! For passwords, you'd likely want to use a 7 bit ASCII format if you're looking at both digits and symbols that are from 0 to 127.

    For a phone number in the United States, you could save each number with 34 bits if it's 000-000-0000 format.

    If there are some area codes that are not used or prefixes and numbers that are never used such as 555 or it's a directory where 800, 888, and 900 numbers are not permitted, you can exploit those restrictions to save even further if you're able to get the prefix numbers to fit between 000 to 511 (even if you have to assign or remap a few numbers to do that on the program itself) to where you'd only need 9 bits for the prefix rather than 10 and can get it down to 33 bits per phone number.

    This is assuming that you're storing it as-is rather than using compression techniques to get the size of it down further, and other lossy compression techniques and omissions that compress space and are either omitted or are not needed right now, but can be recreated losslessly later if they are (e.g: for every phone number on a database, if the - symbol is added for a phone number like 623-478-3686, those two '-' symbols are 2 bytes per phone number addeded to a text, CSV, or other database that don't need to be there and can be re-added later if they are needed by a program when read. If the program knows what the fields are, it can add those 2 bytes back every time and save that for each and every number on a database. Alternatively, the dashes can be removed from user input ahead of time before they are ever looked at, and email addresses with common domain hosts such as gmail.com, yahoo.com, or gmx.net can be replaced instead by only 1 byte where the first bit is a flag that says whether or not it's a recognized email domain, and the other 7 bits handle up to 128 hosts and by doing this, you can compress anywhere from 8 to 15 bytes per email address to only 1 byte with only the data before the @ sign needing to be represented).

    And yes you're right, that it's rare for data to be random and not have a structure or pattern to it. Whether it's a phone number, a mailing address, an account number, or any other thing, it usually has a pattern to it or has maximum fields, and when it does, you can set boundaries to know how much space it requires at a minimum or a maximum.

    There are ways to hide data in the program itself for databases and things that are structured. When that's possible, you can save massive data on symbols...not by compressing them outright, but by never having to store them or representing them other ways that can be understood and remade losslessly.

    Even little things like this can be done even if data deduplication of user data isn't always possible, representing only what is needed and getting rid of the rest reduces data by thousands of kilobytes, megabytes, or gigabytes when figured. Sky is the limit. And if not the sky, well then maybe Shannon's formula will be the limit, but there's still plenty of space.

Similar Threads

  1. Al Zimmerman's Programming Contest
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 3rd October 2013, 04:40
  2. RarVM programming guide
    By Bulat Ziganshin in forum Data Compression
    Replies: 0
    Last Post: 12th October 2012, 16:18
  3. A Programming Problem: How to know if a file is already opened?
    By Fu Siyuan in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 6th August 2010, 17:12
  4. Programming Fonts
    By Bill Pettis in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 10th February 2010, 17:16
  5. Lisaac programming language
    By lunaris in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 11th December 2008, 17:51

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •