Results 1 to 15 of 15

Thread: XWRT (XML-WRT) - an efficient XML/HTML/text compressor

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts

    XWRT (XML-WRT) - an efficient XML/HTML/text compressor

    Do you remember XWRT which is used in many PAQ versions as a text filter? After 8 years I still didn't found a better XML compressor.

    I cleaned sources to successfully compile it for 64-bit Windows and Linux under gcc 4+.
    Sources with Win64 binaries are available at https://github.com/inikep/XWRT/releases

    If you don't know XWRT here is a short description:

    XWRT (XML-WRT) is an efficient XML/HTML compressor (actually it works well with all textual files).
    It transforms XML to more compressible form and uses zlib (default), LZMA, PPMd, or lpaq6 as
    back-end compressor. This idea is based on well-known XML compressor: XMill.
    Moreover XML-WRT creates a semi-dynamic dictionary and replaces frequently
    used words with shorter codes. There are additional techniques to improve
    compression ratio:
    - word alphabet can consist of start tags (like '<tag>'), urls, e-mails
    - special model for numbers encoding
    - input XML file is split into containers
    - there are special containers for dates, time, pages and fractional numbers
    - end tags ('</tag>') are replaced with a single char
    - end tags + EOL symbols can also be replaced with a single char
    - spaceless words model
    - very effective methods for white-space preserving
    - quotes modeling ('="' and '">' replaced with a single char)
    Last edited by inikep; 6th November 2015 at 16:02.

  2. Thanks (4):

    avitar (6th November 2015),Bulat Ziganshin (6th November 2015),Cyan (6th November 2015),Matt Mahoney (6th November 2015)

  3. #2
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    238
    Thanks
    188
    Thanked 17 Times in 12 Posts
    Thanks, can you pls add a win32 binary to the .zip. TIA John

  4. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Do you intend to update the list of encoders included?
    Also, at quick glance I don't see the license...
    Last edited by m^2; 6th November 2015 at 16:55.

  5. #4
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    238
    Thanks
    188
    Thanked 17 Times in 12 Posts
    Thanks, looks good!

    1) Tried vs arc, which I normally use for such files, on 2.7Mbyte file with mostly xml +some other text file. win 7 64, amd 4300 4 core

    arc -mx 68k 3s
    arc 121k 0.3s
    xwrt 103k 0.3s
    xwrt -l5 50k 0.5s (-l6 and -l4 similar)
    Nothing appears to be any better! 50k from 2.7M is very good!

    2) So with -l5 compression 20% better than arc -mx and faster! Is this what'd be expected?

    3) I'd like to try on win32 (xp) too if one can be provided in the .zip

    4) did I miss something in parameters - I'd like command eg 'xwrt -o ans.dat' to generate compressed file named ans.xwrt or whatever directly as with most compressors - at present it generates ans.dat.xwrt!

    john
    Last edited by avitar; 6th November 2015 at 17:23. Reason: tidy up

  6. #5
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I added 32-bit xwrt32.exe to https://github.com/inikep/XWRT/relea.../XWRT34bin.zip

    Quote Originally Posted by avitar View Post
    So with -l5 better than arc -mx and faster! Is this what'd be expected?
    XWRT is a filter for XML files that makes them more compressible.
    XWRT should give improvement for every compressor without a special model for XML files. I think only various versions of PAQ include XML model (based on XWRT).
    You can try XWRT with options -0 -1 -2 -3 and compress output file with freearc.

    Quote Originally Posted by m^2
    Do you intend to update the list of encoders included?
    Also, at quick glance I don't see the license.
    It is possible. I added 3-clause BSD LICENSE to "src/"
    Last edited by inikep; 6th November 2015 at 19:38.

  7. Thanks (2):

    avitar (6th November 2015),m^2 (6th November 2015)

  8. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    avitar, freearc contains its own, more universal, dictionary preprocessor. as far you compress english texts and especially xmls, xml-wrt should be better. for the best result, try xml-wrt + bsc

  9. #7
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    238
    Thanks
    188
    Thanked 17 Times in 12 Posts
    bulat, googled xml-wrt & found a web site, but download broken there. j

  10. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    xml-wrt aka xwrt is the subject of this topic!

  11. #9
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    238
    Thanks
    188
    Thanked 17 Times in 12 Posts
    Dohhhhhh, so what did you mean by 'try xml-wrt+bsc'? What should I try exactly? j

  12. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts

  13. #11
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Code:
    katmacadapc% make
    g++ -Wno-unknown-pragmas -Wno-sign-compare -Wno-conversion  -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math -O3 -D_POSIX_ENVIRONMENT_ -DZ_HAVE_UNISTD_H -D__x86_64__ -I. LZMA/Common/C_FileIO.cpp -c -o LZMA/Common/C_FileIO.o
    g++ -Wno-unknown-pragmas -Wno-sign-compare -Wno-conversion  -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math -O3 -D_POSIX_ENVIRONMENT_ -DZ_HAVE_UNISTD_H -D__x86_64__ -I. LZMA/LZMAlib.cpp -c -o LZMA/LZMAlib.o
    LZMA/LZMAlib.cpp:18:39: fatal error: ./7zip/Common/FileStreams.h: No such file or directory
    compilation terminated.
    make: *** [LZMA/LZMAlib.o] Error 1
    That's because in your repo there's a directory '7ZIP' instead of '7zip'. However, makefile uses uppercase variant too.

  14. #12
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Another set of bugs:
    Code:
    ../afl-clang-fast++ -DFORTIFY_SOURCE=2 -fstack-protector-all -fsanitize=undefined,address -fsanitize-trap=array-bounds,bool,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,nonnull-attribute,null,object-size,returns-nonnull-attribute,shift-base,shift-exponent,signed-integer-overflow,vla-bound -fno-sanitize=alignment -Wno-unknown-pragmas -Wno-sign-compare -Wno-conversion  -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math -O3 -D_POSIX_ENVIRONMENT_ -DZ_HAVE_UNISTD_H -D__x86_64__ -I. src/Encoder.cpp -c -o src/Encoder.o
    afl-clang-fast 1.94b by <lszekeres@google.com>
    src/Encoder.cpp:2891:2: warning: 'delete' applied to a pointer that was allocated with 'new[]'; did you mean 'delete[]'? [-Wmismatched-new-delete]
            delete(inttable);
            ^
                  []
    src/Encoder.cpp:2859:16: note: allocated with 'new[]' here
            int* inttable=new int[size];
    Code:
    ../afl-clang-fast++ -DFORTIFY_SOURCE=2 -fstack-protector-all -fsanitize=undefined,address -fsanitize-trap=array-bounds,bool,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,nonnull-attribute,null,object-size,returns-nonnull-attribute,shift-base,shift-exponent,signed-integer-overflow,vla-bound -fno-sanitize=alignment -Wno-unknown-pragmas -Wno-sign-compare -Wno-conversion  -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math -O3 -D_POSIX_ENVIRONMENT_ -DZ_HAVE_UNISTD_H -D__x86_64__ -I. src/MemBuffer.cpp -c -o src/MemBuffer.o
    afl-clang-fast 1.94b by <lszekeres@google.com>
    src/MemBuffer.cpp:254:7: error: assigning to 'size_t *' (aka 'unsigned long *') from incompatible type 'unsigned int *'
                    Size=new unsigned size_t[count];
    Last edited by m^3; 17th November 2015 at 16:24.

  15. #13
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Thanks for reporting. The bugs are fixed at: https://github.com/inikep/XWRT/tree/master

    Please use GitHub to report bugs: https://github.com/inikep/XWRT/issues

  16. #14
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I'm not a fan of github and if possible, I'd rather continue to use this site. Or switch to email.
    Last edited by m^2; 17th November 2015 at 19:07.

  17. #15
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I started using GitHub one month ago and I like it. But no problem, you can use encode.su or my e-mail.

Similar Threads

  1. text compression?
    By codebox in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th March 2015, 16:31
  2. A (new) Neural Network and XWRT implementation in C/C++
    By mahessel in forum Data Compression
    Replies: 9
    Last Post: 22nd June 2013, 02:09
  3. Most efficient/practical compression method for short strings?
    By never frog in forum Data Compression
    Replies: 6
    Last Post: 1st September 2009, 04:05
  4. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 09:58
  5. XWRT 3.2 (former XML-WRT) with LPAQ6 support released
    By Bulat Ziganshin in forum Forum Archive
    Replies: 2
    Last Post: 3rd November 2007, 00:51

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •