Results 1 to 16 of 16

Thread: Text Detection

  1. #1
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts

    Text Detection

    There are problems in paq8p3 and paq8q text detection. I tried to create a new detection concept but fail on some barriers.

    First idea
    Initially I had the idea to begin only with ascii text and start if there is text from the base between 32 and 128. Then continue with little statistics. Every x character has to be SPACE and a percentage of y% (maybe 80%) has to be between 32 and 128 . If there is any character under 32 it's nowhere in use and text detection ends.

    Too limited
    But looking at some czech examples there is Unicode in use too. I have no much experience with unicode yet so I saved it and saw it is saved in the following form
    Code:
    Dih r[2b UTF]ch . Minizky[2b UTF]t
    Where both could be normally saved in ASCII but there are some characters following which can't.

    The first byte from the Unicode bytes could be a symbol of a language so how could this be properly done. How is a text viewer working?
    If it isn't a symbol between 32 and 128 testing if it is UTF else displaying the ASCII symbol?

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have some plans (but not precisely thought about it) for BIT. So, let me share:
    First of all, we can't limit text detection between printable latin characters (a-z, A-Z, 0-9 etc). Instead, we should definitely try model text with spaces and some the other whitespaces (TAB, CR, LF, punctuations etc). This should cope with local code pages which saved in ASCII range. But, this solutions is not enough alone. Instead, we should also run unicode, multibyte (whatever you want to type in here) etc models simultaneously. For more precise modeling, the models should be word models instead of n-gram character models. During analysis, total entropy should be computed from those models and in the end they should be compared for final conclusion. Of course, these ideas can be improved.

    Edit: Unicode is a really generic term. So, you should talk about lower levels. Mostly used variant is UTF-8 coding which is the most compatible unicode coding version with ASCII range. UTF-8 is a variable coding technique which could store a single character with several bytes. Please look at UTF-8 article in the Wikipedia for details about coding.

    Note: In unicode, there is no codepage information in coded characters. Because, all used characters (Latin, Chinese, Korean, Arabic etc) are in range of 0x0000-0xFFFF.
    Last edited by osmanturan; 26th May 2009 at 19:29.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    In the paq8q thread is a test of the idea including UTF8.

    It's almost what I described above. It first checks if it's a normal ASCII character (32-127) and else if there is a start of an UTF8 character. If that's the case the UTF8 character will be checked until the end. If there is any unexpected byte it will be tried to find valid ASCII values out of it. If there are bytes over 127 or even no UTF8 start a counter will be incremented that allows 3 of those characters in a row. For less then 32 and no line end or similar codes everywhere is the end.

    This can be further improved with sorting out of non allowed UTF8 starts and adding SPACE, LINE END, TAB etc. needs after x bytes. But that's dangerous for XML, HTML and other source code.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    1. I think that the actual task is not the precise text
    detection, but detection of something that paq's text
    model would compress.
    Thus the best way would be to try applying the said
    text model, and discard its output if compression is bad.
    Like that, the decoding might still be faster (asymmetric).

    2. I'd also detect text by space distances and alphabet, but
    not in a strict way like you describe.
    So I'd collect statistics for symbol occurences and distances
    between spaces, and then decide by probability of the printable
    range and reasonable word lengths.

    3. As to utf8, its fairly strict and can be easily detected.
    But its not very reasonable to detect utf8 as text if paq
    won't properly compress it (word model etc).
    One possibility might be recoding utf8 to some similar scheme
    using only latin alphabet.

    4. For compression of real unicode (utf16) a good model has
    to be at least aligned by 16bit, and imho there's no such thing in paq?

  5. #5
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Yes, these are good points. I am only going same way as kaitz did. I don't know how the models in paq work. Separating ascii and UTF or UTF in 2,3,4... byte would be really easy.

    What I have written were only a simple form. If there are almost only normal characters there surely shouldn't be a break in text detection. But if there are many special characters and no space for many bytes it should.

  6. #6
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For the record, I'm almost a noob at languages, Unicode, etc. But here's my two cents ...

    If you want to compress UTF-8, first make sure it's what you think it is (e.g. no BOM, ignore the fact that MS Notepad does it wrong). Modern NT-based Windows are based upon UTF-16, and yet *nix prefers UTF-8. Most common languages (e.g. European) in UTF-8 only need two bytes per char max., then comes three bytes for some lesser ones (Hindi? Chinese?) and then four bytes for the really obscure stuff. If you want to convert it to 8-bit, you can either use old DOS codepage tables (e.g. from iconv or Emacs 23) or if an ASCII representation is better, you can probably just use plain ASCII mnemonics like found in Mined (e.g. "c>" for "U+0109" or "u:" for "latin small letter u w/ diaeresis").

  7. #7
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    4. For compression of real unicode (utf16) a good model has
    to be at least aligned by 16bit, and imho there's no such thing in paq?
    Actually UTF-16 is variable coding too like UTF-8. But, it's aligned 16 bits and also mostly in "Basic Multilingual Plane" range (0x0000-0xFFFF). So, there wouldn't be problem most of time.
    BIT Archiver homepage: www.osmanturan.com

  8. #8
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Simon Berger View Post
    In the paq8q thread is a test of the idea including UTF8.
    I did a wimpy test, and it doesn't seem to help. (At least -m2 and -m5 only made it worse.) Besides, the whole idea of Unicode is multiple languages side by side, which may be difficult to detect specifically. If my (quite small) test files were better, I'd say more, but as is, I think you need some better "real world" tests than these (not attached, too stupid). But feel free to request anyways.

    But that's dangerous for XML, HTML and other source code.
    XML and HTML 4.0 are UTF-8 by default, right? And HTML 3.2 or such was Latin-1. (Corrections welcome.) So it can't be too hard to detect since we already halfway know what to guess.
    Last edited by Rugxulo; 27th May 2009 at 09:52.

  9. #9
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    There are two points

    1) Is the detection working?

    The detection is only a further step of paq8p3 and tries to do the same, only better

    2) Is the detection making sense?
    That's something completely different and wasn't what I payed attention to. Surely it's important for the detection but I didn't think of it.

    I am interested in your test. Do you mean the detection didn't work or the compressed result wasn't good?

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,505
    Thanks
    741
    Thanked 665 Times in 359 Posts
    freearc code:
    Attached Files Attached Files
    • File Type: cpp a.cpp (4.0 KB, 350 views)

  11. #11
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Simon Berger View Post
    There are two points

    1) Is the detection working?
    Yes (although not with -m6 as you previously mentioned).

    2) Is the detection making sense?
    That's something completely different and wasn't what I payed attention to. Surely it's important for the detection but I didn't think of it.
    Dunno, but it's complicated.

    I am interested in your test.
    It was basically just the GPLv2 (unofficial E-o translation) slightly tweaked to have actual UTF-8 chars instead of &284; etc. Also one wimpy (useless) text file I whipped up just for laughs (which is too stupid to share).

    Do you mean the detection didn't work or the compressed result wasn't good?
    The compressed result wasn't any better than without detection (default -m6). It seems even LPAQ8 compresses better so far. But don't let that discourage you!

  12. #12
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    AH ok! You have to know that the current detection is only to speed compression up by a huge amount without loss of much compression ratio. Maybe later there will be some more specific text models.
    LPAQ8 recently got many improvements on text compression so it could be that it beats PAQ there.

    I copied all content of the page and put it to a text file. Compression works great

    Code:
    paq8q_sse2b_intel.exe -5 -m6 out utf.txt
    Creating archive out.paq8q with 1 file(s)...
    ascii.txt 24036 -> (24036 bytes) 6403
    24036 -> 6432
    Time 2.42 sec, used 228695918 bytes of memory
    
    paq8q_sse2b_intel.exe -5 -m5 out utf.txt
    Creating archive out.paq8q with 1 file(s)...
    ascii.txt 24036 -> (3 bytes) TEXT (24033 bytes) 6418
    24036 -> 6447
    Time 1.34 sec, used 226496238 bytes of memory
    Last edited by Simon Berger; 27th May 2009 at 22:22.

  13. #13
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Simon Berger View Post
    AH ok! You have to know that the current detection is only to speed compression up by a huge amount without loss of much compression ratio.
    Ah okay, no I didn't know, but it seems you've done very well then!

  14. #14
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I started another try in preprocessing UTF8, following some points out of Shelwiens post.
    I have a chinese UTF8 text which needs 3 UTF8 bytes per symbol. 1 to 3 bytes of UTF8 can easily converted to one 16bit value.
    I thought wordModel is going to work very well here Bad thought it's a little worse.
    One reason I could think of is that 10 1110... at the start of each UTF8 byte can be compressed better. But there are now aligned words and the data cut 1/3 down. Another one could be that there aren't many same symbols as long as I can see.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    I can suggest some more sample data:
    http://shelwien.googlepages.com/japlist.rar

    Well, its a dictionary for password cracking originally,
    but there're same words in 3 different encodings, so
    I think it might be of some use.

  16. #16
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    403
    Thanks
    154
    Thanked 232 Times in 125 Posts
    UTF-8 and Unicode FAQ
    Testfiles etc.
    KZo


Similar Threads

  1. C++ compile-time constant detection
    By Shelwien in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 5th August 2010, 09:11
  2. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57
  3. New rule for large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 5
    Last Post: 28th October 2007, 22:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •