Results 1 to 12 of 12

Thread: Bit guessing game

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts

    Bit guessing game

    http://ctxmodel.net/files/guess/bit_guess_v0.rar

    A friend's got some university assignment where he
    has to write a program which would guess bits chosen by user.

    But somehow he thought that a statistical model for that
    can be simpler than one used for file compression,
    so his results were rather poor, and he asked me for advice.

    So today I've got some input data (~6kbits which is rather
    good for manually entered stuff, but painfully little for
    context modelling) and patched up some model for that.

    As expected, the data appeared to be near random,
    but I still managed to find some skew there.

    Code:
     6032  wes.txt (original)
    
     5995  wes.txt (after CRLF->LF transform + LF in the end)
      819  wes.ash-s2-o4  
      882  wes.txt.paq8k2 (all paq headers are removed)
      836  wes.txt.paq8px 
    
     5950  wes1.txt (LFs removed)
      760  wes1.ash-s2-o3 
      820  wes1.txt.paq8k2
      783  wes1.txt.paq8px
    
      744  wes2.bin uncompressed (743.75 in fact)
      803  wes2.ash-s0-o1
      783  wes2.bin.paq8px 
      840  wes2.bin.paq8k2
      740  mix(mix(p1,p2),0.5)
    
      731  bitguess v0 (actual rc output
      728.1 entropy; 3221 of 5950 guessed (54.13%)

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Shelwien View Post
    the correct experiment would be to build model for first half of data and then check it on second one. otherwise, you may be just moved entropy into model selection

  3. #3
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    564
    Thanks
    215
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    the correct experiment would be to build model for first half of data and then check it on second one. otherwise, you may be just moved entropy into model selection
    Is anything known about how the bits have been entered? Because this is a good advice, but could fail if the first half of the data was entered by person A and the second half was entered by person B or something similar. Anyway, if we don't know anything, splitting up the data in different ways and trial and error is our best choice. Perhaps, in this case, we should keep track of the original CR/LFs and prefer those for splitting.

    If the data has been manually entered, but comes from some random source (dice roll, card deck...) then the whole thing will be useless and we won't get any better than perhaps lucky 60%.
    But if this is not the case, we might succeed with some "human" models, like:

    • Humans will avoid long runs of 0's or 1's
    • Humans might tend to favorize 0's or 1's


    Anyway, as it's only some thousand bits of data, you'll most likely end up moving entropy into the model. The cleanest way would be to write an assembler program really compressing the data (size of executable smaller than 744 bytes).
    Last edited by schnaader; 23rd October 2009 at 00:13.
    http://schnaader.info
    Damn kids. They're all alike.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    > the correct experiment would be to build model for first
    > half of data and then check it on second one. otherwise,
    > you may be just moved entropy into model selection

    Sure, I just left the checking to the guy who can ask some
    students to enter more data :)
    Also that's why I didn't try any advanced methods
    (delayed counters etc) there, and stopped at just 4 counters.
    Anyway, the data are fairly unpredictable.

    > Is anything known about how the bits have been entered?

    Manually. That's the whole idea.
    In fact, I also have a fairly neat C# app for this "game",
    but I won't post it without permission.

    > Because this is a good advice, but could fail if the first
    > half of the data was entered by person A and the second
    > half was entered by person B or something similar.

    There're a few "sessions" in the data which are certainly
    somewhat different. But the model is adaptive.

    > Anyway, if we don't know anything, splitting up the data
    > in different ways and trial and error is our best choice.

    There're no visible patterns, and overall it seems too random
    to try any thorough analysis with just that much of data.

    > Perhaps, in this case, we should keep track of the
    > original CR/LFs and prefer those for splitting.

    Sure, but again, with only 5950 bits it doesn't seem to be useful.
    Note that paqs actually expanded it, and expanded the version
    with removed LFs less than original one.

    > If the data has been manually entered, but comes from some
    > random source (dice roll, card deck...) then the whole
    > thing will be useless and we won't get any better than
    > perhaps lucky 60%.

    That's still better than my 54% ;)
    But actually even getting 50% with that kind of input
    might be not as easy as it seems ;)

    > But if this is not the case, we might succeed with some "human" models, like:
    > * Humans will avoid long runs of 0's or 1's

    I thought about this too, but not-so-long runs are there,
    and there's not enough data for longer runs to appear randomly.

    > * Humans might tend to favorize 0's or 1's

    '0' 2887
    '1' 3063

    Sure, but not enough to be of much use.
    Last edited by Shelwien; 23rd October 2009 at 00:46.

  5. #5
    Member
    Join Date
    May 2009
    Location
    China
    Posts
    36
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Can somebody attach the bit_guess_v0.rar in here. or send mail to me (wwy@mailbox.gxnu.edu.cn)
    I can't read Shelwien's website.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    here
    Attached Files Attached Files

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Received the expected data update...
    and got the expected result with it - 48% overall guess rate
    (both with new data only and old+new)

    Also, simple model retuning didn't really help, and I had to
    add some (simple) kind of secondary model to improve the rate.
    It just sets the predicted bit to the previous bit when
    the estimated probability is near 0.5 - and just that
    got me ~100 extra guesses.

    http://www.ctxmodel.net/files/guess/bit_guess_v1.rar

    Code:
    1 1 0.75645   728.5; 3298 of 5950 guessed (55.43%)
    0 0 0.35058   898.5; 4009 of 7315 guessed (54.81%)
    
    4009-3298=711
    7315-5950=1365
    711*100/1365=52.0879120879121

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Decided to evaluate some precise probability estimators
    (from bit counts) instead of linear counters used in bit_guess model.
    Code:
    914.25          // raw 7314 bits
    (1)      (2)
    914.73  1008.36 // (n1>0) ? n0/(n0+n1) : 0.5;
    914.81   992.28 // (n0+0.5)/(n0+n1+1)
    914.73   978.42 // (n0+1)/(n0+n1+2)
    914.59   968.63 // MP estimator 
    914.18   912.83 // betaP 
    914.18   912.83 // beta1(0.5)
    914.10   912.57 // beta1(0.00001)
    916.16   997.48 // BFA
    (1) global bit counts
    (2) local bit counts with window size 9
    So it appears possible to get 1.5-2% skew just
    by using best of these estimators, without any
    actual context model
    Code:
    0 0 0.95083   742.7; 3071 of 5950 guessed (51.61%)
    1 0 1.01620   912.7; 3802 of 7315 guessed (51.98%)
    Will see how to make use of it.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Seems like a good problem, inspired some interesting experiments.

    http://ctxmodel.net/files/guess/bit_guess_v2.rar

    Code:
    v0 22-10-2009:
      1 1 0.99525   728.1; 3221 of 5950 guessed (54.13%)
    
    + Initial implementation
    + mix(mix(mix(p1,p2),mix(p3,p4)),0.5) based on mix_test
    + Parameter optimization
    
    v1 01-11-2009:
      1 1 0.75645   728.5; 3298 of 5950 guessed (55.43%)
      0 0 0.35058   898.5; 4009 of 7315 guessed (54.81%)
    
    + More input data included, model retuned
    + Dumb secondary model added (trying the prev bit if model estimation is near 0.5)
    
    v2 07-11-2009
      0 0 0.77592   724.7; 3286 of 5950 guessed (55.23%)
      1 1 0.35727   891.6; 4023 of 7315 guessed (55.00%)
    or
      0 0 0.93299   729.2; 3376 of 5950 guessed (56.74%)
      1 1 0.71908   897.1; 4135 of 7315 guessed (56.53%)
    
    + Completely new model using FP math
     + new "beta counters" for contextual probability estimation 
     + new logistic mixer implementation
      + precise float-point math, no tunable tables
      + BFA update
      + weight extrapolation
    + mix(v1P,mix(mix(mix(mix(mix(p1,0.5),mix(p2,0.5)),mix(p3,0.5)),p4),0.5))
      (v1 model mixed with the new one)
    + optimization by guess-rate instead of entropy

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Found something interesting in ctxmodel access log
    http://wiki.asoiu.com/index.php/%D0%...D0%B5%D1%82%22
    (in russian)

    The algorithm is unrelated to what I did here though

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    So (from Google translation) they predict bits by using 16 context models which are mixed with a tree of 2 input mixers.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Yeah, a tree of binary linear mixers, where mixed probability is updated
    as a counter and that change back-propagated into mixer weight.
    Also it says that there're context models with 1-7 previous bits as
    context, an inverted order0 model, and apparently some kind of secondary
    models guessing from "bit density" used as context.

Similar Threads

  1. Poor compression of bit-version of PPM
    By Stefan in forum Data Compression
    Replies: 20
    Last Post: 16th March 2010, 17:58
  2. Do you have a 64-bit machine at home?
    By encode in forum The Off-Topic Lounge
    Replies: 22
    Last Post: 4th December 2009, 14:09
  3. BIT Archiver
    By osmanturan in forum Data Compression
    Replies: 137
    Last Post: 16th January 2009, 20:19
  4. RINGS Fast Bit Compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 115
    Last Post: 26th April 2008, 22:58
  5. Bit Archive Format
    By osmanturan in forum Forum Archive
    Replies: 39
    Last Post: 29th December 2007, 00:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •