Results 1 to 26 of 26

Thread: Just-In-Time Compilation Improvement For ZPAQ

  1. #1
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Just-In-Time Compilation Improvement For ZPAQ

    Hi everyone,
    Because of keeping myself busy with my own job, I couldn't do anything for a long time in compression works
    After spending 1-2 days on computer, I have successfully written a x86 JIT for ZPAQL. It seems Matt is a completely x86 guy Because, implementation was fairly straight forward (especially having A, B, C, D registers). But, it's still incomplete. Current implementation covers 89 of 215 instructions (~41%). Most of arithmetic commands and branching commands are missing. But, it can run both fast.cfg and mid.cfg without any errors (well, so far ). Speed improvement isn't that much. Because, most of time spent on argument passing to native procedure. This effect especially can be seen on small configuration files. Long configuration files' improvement much better. This effect can be eliminated by detecting unused arguments. BTW, all generated assembly code printed by start of the program.

    Some people may offer LLVM or Google NaCl. But, they are just too bloated. I've spend over a week to figure out LLVM details. I have to admit that it's good, but it's bloated at the same time. AFAIR Google NaCl is also based on LLVM. So, we don't need to discuss it further. Even this small implementation has ~45 KiB source code (only JIT part). I couldn't imagine contribution to executable size by integrating LLVM into ZPAQ. But, it could be cool if someone make it.

    As to code security and sandboxing, I think, it's not really necessary. Because, by Matt's specification any program cannot reach any unrelated memory locations. All of memory accesses are circular basis on specific variables (no overflow at all).

    Anyway here is the key informations about the implementation:

    Advantages:
    - There is no branch at all for executing any instruction
    - Uses advanced addressing mode to reduce complexity as much as possible.
    - No local variables (well, seems I need a single local variable for F register while implementing branching).

    Disadvantages / Bottlenecks:
    - All arguments passed to native procedure each time
    - Assumes HALT command at the end of the program

    Planned:
    - Rewrite memory related XCHG opcode with equivalent MOV instructions to eliminate XCHG's implicit LOCK.
    - Implement a better visualizing for generated opcodes.
    - Cover remaining instructions.
    - Release the whole source code

    Test Results for ENWIK8 on my laptop (Intel i7 740QM 1.73 GHz 4x2 threads, 8GiB RAM):
    No code has to be inserted here.

    I've attached the zpaq (from Matt's site), UnzpaqOriginal (original with compiled VS 2010) and unzpaq (JIT version compiled with VS 2010 with same settings as UnzpaqOriginal's). If you want to test it, you can use attached batch files. Currently I didn't release the source. I'll release the whole source when I cover the all commands.

    Matt, could you point out some hints about this implementation? Because, I might missed some important thing. If I could realize earlier, finishing the implementation much easier than the fixing the finished implementation
    Attached Files Attached Files
    Last edited by osmanturan; 12th April 2011 at 12:40. Reason: Spelling
    BIT Archiver homepage: www.osmanturan.com

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Very interesting. Thanks for doing this. I tested enwik8 on a 2.0 GHz T3200 laptop:

    fast compress: 76.1 sec
    interpreted decompress: 129.3
    JIT decompress: 110.2

    mid compress: 231.2
    interpreted: 442.8
    JIT decompress: 414.1

    Also for comparison I tested fast.cfg with ZPAQ compiled with g++. I found earlier that g++ produces faster code than VS.

    fast compress: 69.9
    decompress: 77.9
    (optimization with compiled ZPAQL is built in for fast, mid, and max).

    Also, a lot of speedup is due to optimizing predict() and update(). These don't have as much code to interpret but they are executed every bit instead of every byte. They are probably harder to write too but you could generate a lot of the code automatically using

    zpaq kocfast archive file

    which generates zpaqopt.cpp, then convert that to assembler:

    g++ -O3 -fomit-frame-pointer -march=pentiumpro -S -masm=intel -DNDEBUG zpaqopt.cpp

    which creates zpaqopt.s. It is interesting to see different optimization techniques, for example, the HASH instruction uses:

    imul ecx, ecx, 773
    add ecx, 395776

    instead of add 512; imul 773. Not sure if it makes any difference. Also,

    lea ecx, [ebp-1]

    instead of mov ecx, ebp; dec ecx. Not sure how much faster this is. I do remember long ago doing tests and finding that add 1 and sub 1 are faster than inc and dec. This might also make it easier to write x86-64 because inc and dec are changed to 2 byte instructions so their opcodes can be used as REX prefixes.

    The generated code does start and finish by pushing and popping registers, but instead of push and pop instructions, it uses add esp and a bunch of movs.

    I guess what I need to do is make up some test cases that use all the different instructions and components.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Pentium is a superscalar processor with two pipelines, so compiler tries to group instructions in pairs that could be executed by processor simultaneously. Core i7 for example has at least four pipelines, most of them are different, eg. one pipeline handles any instructions, two handle simple instruction, etc. I've forgot many details, but you can refer to: http://agner.org/optimize/

  4. #4
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yeah. I'm aware of some optimizations. Especially out-of-order execution. I have to admit that I didn't spend too much time on optimization (rather spent on "just-works"), but I've at least considered instructions cycles. Well, more specifically I've used Agner's instruction tables about latency and reciprocal values for Sandy Bridge. The generated code is more likely a procedure with couple of arguments. And it's executed as follow:

    Code:
    nativeCode->Run(&a, &b, &c, &d, &r[0], &h[0], h.size()-1, &m[0], m.size()-1);
    ...
    void Run(U32* a, U32* b, U32* c, U32* d, U32* r, U32* h, U32 hN, U8* m, U32 mN)
    {
        FunctionAddress fcn = reinterpret_cast<FunctionAddress>(this->_code);
        fcn(a, b, c, d, r, h, hN, m, mN);
    }
    As you can see, all states are passed as address through arguments. I know this is not good. Because, it introduces indirect addressing (as a result two instructions like mov eax, [ebp+X], mov eax, [eax] generated instead of one). For speed reasons, I've copied A, B, C, D into EAX, EBX, ECX, EDX respectively in the procedure and used them as native registers. So, that's the reason why you saw lot's of mov's introduced at the beginning and end of the program. I'll rewrite those with direct addressing (2x less instructions).

    Further optimizations (including predict() and update()) is possible with current framework. Because, I'm creating reusable opcode encoding procedures. Like that:
    Code:
    template <bool w>
    void EmitIncReg(RegX86::Enum reg)
    {
        VerboseCode("inc %s\n", RegX86::ToString<w>(reg));
    
        if (w)
            EmitByte(0x40 | reg); // alternate coding (short version)
        else
            EmitByte(0xFE), EmitByte(0xC0 | reg);
    
        FinishInstruction();
    }
    ...
    int pc = 0;
    while (pc < length)
    {
        switch (source[pc++])
        {
            case 0: error("Undefined instruction");
    	case 1: // A++ | ++a
    	    EmitIncReg<1>(RegX86::EAX);
    	    break;
            ...
        }
    ...
    }
    Currently, I've covered ~51% commands and I'm working on branching (I've found a simple way to rewrite actual addresses on second pass). Also, as a minor speed optimization, I've changed epilog with "leave" instruction. My short term goal is to capable of executing max.cfg (it includes some branches).

    Matt, I guess you might created a "stress cfg" file which includes all commands while developing ZPAQL. If that's true, I really need that file
    BIT Archiver homepage: www.osmanturan.com

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by osmanturan View Post
    Matt, I guess you might created a "stress cfg" file which includes all commands while developing ZPAQL. If that's true, I really need that file
    Well, it's on my to-do list. What I actually did for testing was to look at the code very carefully

    Also, I might suggest passing a pointer to ZPAQL to your native code function instead of passing all the registers and arrays. Your code will depend on the ZPAQL layout but I don't plan to change it. This would also simplify a x86-64 port where the first 6 args are passed in registers.

  6. #6
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well, in that case, I need some volunteers to test native code (after finishing especially branching). As to argument passing, I want to use direct addressing (i.e. mov eax, [0x0040f00]). I mean, native code will include real pointer bases (not references). So, this could be more faster I think. As to x64 opcodes, I think it's not a big deal after I've finished my job. BTW, if I'm not wrong, you have mentioned that "pc" can be changed on each run of the program. Could you give me more details about it?
    BIT Archiver homepage: www.osmanturan.com

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, no reason why you can't use direct addressing for everything. Also for indexing H and M, you can AND with a constant figured out by your code generator.

    PC (the ZPAQL program counter) is reset to 0 for each call so execution starts at the beginning of the program. All of the other registers and memory locations are initialized to 0 and preserved between calls, except A which is the input byte.

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Translation could fail (as it does now) if the code jumps into the middle of a 2 byte instruction, which is legal in ZPAQL and happens in max_enwik9.cfg. pzpaq will fall back to interpreting the code in this case. zpaq just gives an error and you have to manually omit the "o" prefix.
    Seems I misunderstood that comment.

    As to direct addressing, JIT will assume you won't call h.resize() or m.resize() during process, will you? Anyway, seems everything is more clear now. Thanks.
    BIT Archiver homepage: www.osmanturan.com

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    The H and M arrays cannot be resized once modeling starts. (There are no ZPAQL instructions to do it). However sizes might be different in different blocks because blocks are independent. At block boundaries (but not segment boundaries), everything should be reset to 0.

    Attached is a simple test case that exercises all the legal ZPAQL instructions. If decompression is correct then you should get a 4 byte file ok.txt with contents "OK\r\n" and the checksum should verify. All of these should work:

    zpaq x test1.zpaq ok.txt
    zpaq ox test1.zpaq ok.txt
    pzpaq -dc test1.zpaq > ok.txt
    unzpaq x test1.zpaq ok.txt

    If any instruction is incorrectly executed then it will execute the ERROR instruction, but if your code treats this as a NOP then execution will fall through and the test may still incorrectly pass. I have included test1.cfg. You can also test as follows:

    zpaq prtest1 nul: ok.txt
    zpaq oprtest1 nul: ok.txt

    either of which should create the 4 byte file ok.txt. You can create the archive like this:

    zpaq ctest1 test1.zpaq ok.txt

    or

    zpaq octest1 test1.zpaq ok.txt

    All of the ZPAQL code is tested in the postprocessor. There is no actual modeling. The postprocessor ignores the decoded output and executes only when it sees EOF. Then it does a bunch of tests and if it gets through without executing an ERROR instruction then it outputs the 4 byte file.

    This test doesn't fully test the specification. It doesn't test predict(), update(), or the HCOMP section. There are also ways that instructions could be incorrectly implemented and the test would still pass. I plan to write some more tests.

    Edit: Also this test does not check for jumping into the middle of a 2 byte instruction. I assume it's OK to fail here and interpret the code instead.
    Attached Files Attached Files
    Last edited by Matt Mahoney; 14th April 2011 at 00:26.

  10. #10
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've rewritten memory accesses as direct access. Interestingly it's not working (yet). Using Data Segment prefix opcode (i.e. mov eax, dword ptr ds:[ebp+XXX]) sometimes causes access violations. And moreover decompression fails on other cases. Certainly I'm doing something wrong. Any ideas? Here is the output of fast.cfg:
    Code:
       0: push ebp
       1: mov ebp, esp
       3: sub esp, 4h
       9: push ebx
      10: push esi
      11: push edi
      13: mov eax, ds:[ebp+17fc10h]
      20: mov ebx, ds:[ebp+17fc14h]
      27: mov ecx, ds:[ebp+17fc18h]
      34: mov edx, ds:[ebp+17fc1ch]
      40: mov edi, ebx
      42: and edi, 3h
      49: mov ds:[edi+6d1d40h], al
      55: xor eax, eax
      57: xor edx, edx
      59: mov edi, ebx
      61: and edi, 3h
      68: movzx esi, ds:[edi+6d1d40h]
      75: add eax, esi
      77: add eax, 200h
      83: imul eax, eax, 305h
      89: dec ebx
      90: mov edi, ebx
      92: and edi, 3h
      99: movzx esi, ds:[edi+6d1d40h]
     106: add eax, esi
     108: add eax, 200h
     114: imul eax, eax, 305h
     120: mov edi, edx
     122: and edi, 1h
     129: mov ds:[ebp+4h*edi+6d1cc0h], eax
     136: inc edx
     137: dec ebx
     138: mov edi, ebx
     140: and edi, 3h
     147: movzx esi, ds:[edi+6d1d40h]
     154: add eax, esi
     156: add eax, 200h
     162: imul eax, eax, 305h
     168: dec ebx
     169: mov edi, ebx
     171: and edi, 3h
     178: movzx esi, ds:[edi+6d1d40h]
     185: add eax, esi
     187: add eax, 200h
     193: imul eax, eax, 305h
     199: mov edi, edx
     201: and edi, 1h
     208: mov ds:[ebp+4h*edi+6d1cc0h], eax
     216: mov ds:[ebp+17fc10h], eax
     223: mov ds:[ebp+17fc14h], ebx
     230: mov ds:[ebp+17fc18h], ecx
     237: mov ds:[ebp+17fc1ch], edx
     243: pop edi
     244: pop esi
     245: pop ebx
     246: leave
     247: ret
    Command/Opcode Address Mapping Table
    0 -> 40
    1 -> 55
    2 -> 57
    3 -> 59
    4 -> 89
    5 -> 90
    6 -> 120
    7 -> 136
    8 -> 137
    9 -> 138
    10 -> 168
    11 -> 169
    12 -> 199
    13 -> 215
    
    Total 14 commands are compiled into 58 native instructions (248 bytes code)
    I've carefully looked at the generated opcodes in debug mode to whether I'm encoding opcodes as they should be. Seems encoding is ok. So, only "the way" of direct access seems problematic.
    BIT Archiver homepage: www.osmanturan.com

  11. #11
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Edit: "Command/Opcode Address Mapping Table" is wrong currently. It doesn't take DS prefix opcode into account. So, just ignore it.
    Edit2: Sorry for double posting. I've misused buttons
    BIT Archiver homepage: www.osmanturan.com

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I wouldn't expect you to have to mess with segment registers. I think everything is mapped to the same address space.

    Anyway I created another test case. This one generates about 8000 lines of random ZPAQL postprocessor code which should exercise all of the instructions. It doesn't completely test all the jump instructions. I'll need to hand write some code to do that. I tried larger tests but I found when the code is converted to C++ and compiled with g++ that g++ runs out of memory (I have 3 GB). The compiler takes about a minute as it is. It is much faster interpreted.

    You use the program and zpaq to generate a test file, then zpaq to compress it and any decompresser to decompress it.

    Code:
    /* testgen.cpp - Generates a test ZPAQ config file.
    (C) 2011, Dell Inc. Written by Matt Mahoney.
    Licensed under GPL v3. http://www.gnu.org/licenses/gpl.html
    
    To use:
    
    g++ testgen.cpp -o testgen       (compile testgen)
    testgen > test.cfg               (create config file test.cfg)
    zpaq prtest nul: testdata        (creates a 3754 byte test file)
    zpaq ctest testdata              (compresses to testdata.zpaq)
    zpaq x testdata.zpaq out1        (out1 should match testdata)
    zpaq ox testdata.zpaq out2       (out2 should match testdata)
    pzpaq -dkvc testdata.zpaq > out3 (out3 should match testdata)
    unzpaq x testdata.zpaq out4      (out4 should match testdata)
    
    test.cfg creates a ZPAQ configuration file describing an indirect
    order 0 model and a postprocessor that ignores the decoded data
    except for EOF, and then generates output using 16603 mostly random
    bytes of ZPAQL code. It can only be used to compress testdata as
    created above by using zpaq to run the postprocessor in standalone
    mode. Once created, it can be compressed with zpaq and then
    decompressed with any zpaq compatible decompressor.
    
    The generated code is "mostly" random with the following exceptions.
    - No illegal instructions.
    - No ERROR instructions.
    - No backward jumps.
    - No jumps into the middle of a 2 or 3 byte instruction.
    - One out of 8 instructions is conditionally skipped using one of
      the 4 constructs JT, JF, JT 3 LJ, JF 3 LJ
    - All instructions that modify the A register are followed by OUT.
    - HALT only once at the end.
    - The beginning of the program checks for EOF before doing the rest.
    - JMP is not tested.
    */
    
    #include <stdio.h>
    #include <stdlib.h>
    
    // Opcodes from zpaq.cpp
    static const char* op[256]={
    "error","a++",  "a--",  "a!",   "a=0",  "",     "",     "a=r",
    "b<>a", "b++",  "b--",  "b!",   "b=0",  "",     "",     "b=r",
    "c<>a", "c++",  "c--",  "c!",   "c=0",  "",     "",     "c=r",
    "d<>a", "d++",  "d--",  "d!",   "d=0",  "",     "",     "d=r",
    "*b<>a","*b++", "*b--", "*b!",  "*b=0", "",     "",     "jt",
    "*c<>a","*c++", "*c--", "*c!",  "*c=0", "",     "",     "jf",
    "*d<>a","*d++", "*d--", "*d!",  "*d=0", "",     "",     "r=a",
    "halt", "out",  "",     "hash", "hashd","",     "",     "jmp",
    "a=a",  "a=b",  "a=c",  "a=d",  "a=*b", "a=*c", "a=*d", "a=",
    "b=a",  "b=b",  "b=c",  "b=d",  "b=*b", "b=*c", "b=*d", "b=",
    "c=a",  "c=b",  "c=c",  "c=d",  "c=*b", "c=*c", "c=*d", "c=",
    "d=a",  "d=b",  "d=c",  "d=d",  "d=*b", "d=*c", "d=*d", "d=",
    "*b=a", "*b=b", "*b=c", "*b=d", "*b=*b","*b=*c","*b=*d","*b=",
    "*c=a", "*c=b", "*c=c", "*c=d", "*c=*b","*c=*c","*c=*d","*c=",
    "*d=a", "*d=b", "*d=c", "*d=d", "*d=*b","*d=*c","*d=*d","*d=",
    "",     "",     "",     "",     "",     "",     "",     "",
    "a+=a", "a+=b", "a+=c", "a+=d", "a+=*b","a+=*c","a+=*d","a+=",
    "a-=a", "a-=b", "a-=c", "a-=d", "a-=*b","a-=*c","a-=*d","a-=",
    "a*=a", "a*=b", "a*=c", "a*=d", "a*=*b","a*=*c","a*=*d","a*=",
    "a/=a", "a/=b", "a/=c", "a/=d", "a/=*b","a/=*c","a/=*d","a/=",
    "a%=a", "a%=b", "a%=c", "a%=d", "a%=*b","a%=*c","a%=*d","a%=",
    "a&=a", "a&=b", "a&=c", "a&=d", "a&=*b","a&=*c","a&=*d","a&=",
    "a&~a", "a&~b", "a&~c", "a&~d", "a&~*b","a&~*c","a&~*d","a&~",
    "a|=a", "a|=b", "a|=c", "a|=d", "a|=*b","a|=*c","a|=*d","a|=",
    "a^=a", "a^=b", "a^=c", "a^=d", "a^=*b","a^=*c","a^=*d","a^=",
    "a<<=a","a<<=b","a<<=c","a<<=d","a<<=*b","a<<=*c","a<<=*d","a<<=",
    "a>>=a","a>>=b","a>>=c","a>>=d","a>>=*b","a>>=*c","a>>=*d","a>>=",
    "a==a", "a==b", "a==c", "a==d", "a==*b","a==*c","a==*d","a==",
    "a<a",  "a<b",  "a<c",  "a<d",  "a<*b", "a<*c", "a<*d", "a<",
    "a>a",  "a>b",  "a>c",  "a>d",  "a>*b", "a>*c", "a>*d", "a>",
    "",     "",     "",     "",     "",     "",     "",     "",
    "",     "",     "",     "",     "",     "",     "",     "lj"};
    
    static const char* cond[4]={"if", "ifnot", "ifl", "ifnotl"};
    
    // Random byte based on RC4 with no key
    int rnd() {
      static int t=-1;
      static unsigned char i=0, j=0, s[256];
      if (t<0) do s[i]=i; while (++i);
      j+=s[++i];
      t=s[i];
      s[i]=s[j];
      s[j]=t;
      return s[(s[i]+t)&255];
    }
    
    // Create test.cfg
    int main() {
      printf(
        "(test.cfg)\n"
        "comp 8 8 9 10 1\n"
        "  0 icm 5\n"
        "hcomp\n"
        "  halt\n"
        "pcomp copy ;\n"
        " a> 255 ifnot halt endif\n");
      for (int i=0; i<10000; ++i) {
        int r=rnd();
        int cr=rnd()&31;
        if (op[r][0] && r!=0 && op[r][0]!='j' && r!=56 && r!=255) {
          if (cr<4) printf(" %s", cond[cr]);
          printf(" %s", op[r]);
          if (r%8==7) printf(" %d", rnd());
          if (r<=4 || r==59 || r>=64 && r<=71 || r>=128 && r<=215)
            printf(" out");
          if (cr<4) printf(" endif");
          printf("\n");
        }
      }
      printf(" halt\n"
        "end\n");
      return 0;
    }
    Under Linux, change nul: to /dev/null and change copy to cp in the .cfg file.

  13. #13
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts

    Dell ?

    (C) 2011, Dell Inc. Written by Matt Mahoney.
    Licensed under GPL v3. http://www.gnu.org/licenses/gpl.html
    ---
    @Matt Mahoney:

    - are you working for DELL now ?
    - you are working for the same firm as inikep ?

    if so, then congratulation ! (where is "Ocarina Networks, Inc." ?)

    best regards

  14. #14
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well, I have to use DS segment. Because, program data (static allocated variables) and heap lies there. There is no problem at working with ZPAQL machine registers (a, b, c, d) which is static allocated (i.e. class X { int a, int b...}). I can read and write perfectly. But, dynamically allocated space (i.e. x=malloc(...)) seems problematic. It sometimes raises access violation when I try to read or write it. So, I can't address "h" and "m" currently.

    BTW, thanks for your effort to make a "real" stress test
    BIT Archiver homepage: www.osmanturan.com

  15. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by joerg View Post
    (C) 2011, Dell Inc. Written by Matt Mahoney.
    Licensed under GPL v3. http://www.gnu.org/licenses/gpl.html
    ---
    @Matt Mahoney:

    - are you working for DELL now ?
    - you are working for the same firm as inikep ?

    if so, then congratulation ! (where is "Ocarina Networks, Inc." ?)

    best regards
    Dell bought Ocarina.

  16. #16
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @m^2: thank you for the explanation

    more info about this:
    http://content.dell.com/us/en/corp/d...quisition.aspx

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, so I haven't really changed jobs since I still work for the same people, really. They are supportive of my work with ZPAQ and keeping it open source.

    I need to look at the memory access problem. I earlier did experiments where I could write code to the stack, heap, and static memory and execute it in 32 bit Vista, but this seems to be a different problem. I wonder if putting the code in the heap would solve it. Where are you putting it now?

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,594
    Thanks
    251
    Thanked 1,135 Times in 624 Posts
    http://stackoverflow.com/questions/4...912662#4912662

    Basically, sometimes writing to memory and executing it works, but its not universal.
    Sometimes (like on x64) its necessary to set the executable attribute on corresponding memory pages via VirtualProtect
    (or just allocate some executable pages).

    Also note that instruction codes for x86 and x64 are similar, but not really the same,
    x64 version would have to generate different code.

  19. #19
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Matt, I've seems solved my problem. It's about actually wrong interpreting of mod r/m byte values. I'm changing mod r/m encoding. Thanks for Shelwien who pointed out it.

    As to placing code into heap is not a way. It basically throws exception. It's kind of Vista/Windows7 protection. So, even initial release uses VirtualAlloc with executable permission. Anyway, I may post a new release tonight if I won't go to bed earlier
    BIT Archiver homepage: www.osmanturan.com

  20. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I ran some more tests with a modification of Shelwien's program from http://codepad.org/sQoF6kR8
    In the version below the program generates x86 code that multiplies 3 numbers from the stack, heap, and static memory, then puts 3 copies of the result into each of the 3 types of memory. I tried putting the executable code into each of the 3 types (commented out lines).

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    typedef unsigned char byte;
    
    typedef void (*pfunc)(void);
    
    union funcptr {
      pfunc x;
      byte* y;
    };
    
    int main( void ) {
    
    // Uncomment one of the 3 lines below
    //  byte buf[1<<16]={0};                 // g++ -O crash, g++ bcc dm OK, VC++ DEP error
    //  static byte buf[1<<16];              // g++, bcc, dm OK, dm -o crash, VC++ DEP
    //  byte* buf = (byte*)calloc(1, 1<<16); // g++, bcc, dm OK, g++ -O bad, VC++ DEP
    
      if (!buf) return 1;  // calloc failed?
      byte* p = buf;
    
      int arg1=0, res1=0;
      static int arg2=0, res2=0;
      int &arg3=(int&)p[1000], &res3=(int&)p[1004];
    
      *p++ = 0x60; // pusha
    
      *p++ = 0xA1; // mov eax, [arg1]
      (int*&)p[0] = &arg1; p+=sizeof(int);
    
      // 0FAF05 imul eax,[arg2]
      *p++ = 0x0F;
      *p++ = 0xAF;
      *p++ = 0x05;
      (int*&)p[0] = &arg2; p+=sizeof(int);
    
      // 0FAF05 imul eax,[arg3]
      *p++ = 0x0F;
      *p++ = 0xAF;
      *p++ = 0x05;
      (int*&)p[0] = &arg3; p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      (int*&)p[0] = &res1; p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      (int*&)p[0] = &res2; p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      (int*&)p[0] = &res3; p+=sizeof(int);
    
      *p++ = 0x61; // popa
      *p++ = 0xC3; // ret
    
      funcptr func;
      func.y = buf;
    
      arg1 = 7; arg2 = 11; arg3 = 13;
      res1 = 0; res2 = 0; res3 = 0;
    
      func.x(); // call generated code
    
      printf("args = %d %d %d res = %d %d %d\n",
        arg1, arg2, arg3, res1, res2, res3);
    
      return 0;
    }
    I tested with 4 compilers in 32 bit Vista. Visual C++ crashes and gives a DEP (data execution prevention) error in all 3 cases, with or without optimization, and whether or not arg1, arg2, arg3, res1, res2, res3 are declared volatile or not.

    Borland 5.5.1 works in all cases.

    Digital Mars 8.42n works in all 3 cases except with optimization (-o) in static memory. I tried making the variables volatile but it made no difference.

    g++ 4.5.0 works when the code is in static memory. It gives an incorrect result (0 600747290 0 instead of 1001 1001 1001) on the heap when compiled with -O, but this problem goes away when the variables are declared volatile or the code is not optimized. It crashes on the stack when compiled with -O but works unoptimized. Making the variables volatile doesn't make a difference.

    Just to be sure, I tried putting all of the variables in one place (stack, heap, or static) but this seems to make no difference in the results.

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    For Visual Studio you can add /NXCOMPAT:No to the linker command line.

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Thanks. I just figured that out and was about to post but you beat me to it. In fact, the code works on the stack, heap, or static, with optimization, and without volatile variables as far as I can tell. I tested like this:

    cl /O2 program.cpp /link /nxcompat:no

    Edit: this is off by default in g++ but you can turn it on with -Xlinker --nxcompat
    Last edited by Matt Mahoney; 15th April 2011 at 00:37.

  23. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I figured out why g++ -O was crashing when the code was executed on the stack. The writing into buf was completely optimized out! It was trying to execute a string of 0 bytes. I figured this out by looking at the generated assembler code with -S.

    To fix that, I made p volatile. This fixed the problem with g++ -O but it still crashed with g++ -O2 because the address assignments to buf were still optimized out. To fix that, I replaced the (int*&) cast with memcpy(). This fixed the problem. (Casting to (volatile int*&) did not work). So now everything works with g++ -O2, VC++ /O2 /nxcompat:no, and Borland -O -6. It still fails with Mars -o in static memory but I don't know why.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    typedef unsigned char byte;
    typedef void (*pfunc)(void);
    
    union funcptr {
      pfunc x;
      volatile byte* y;
    };
    
    int main( void ) {
    
    // uncomment one of the next 3 lines
    //  byte buf[1<<10]={0};
    //  static byte buf[1<<10];  // dmc -o crashes?
    //  byte* buf = (byte*)calloc(1, 1<<10);
    
      if (!buf) return 1;  // calloc failed?
      volatile byte* p = buf;
    
      volatile int arg1=0, res1=0;
      volatile static int arg2=0, res2=0;
      volatile int &arg3=(int&)p[1000], &res3=(int&)p[1004];
      void *arg1p=(void*)&arg1, *arg2p=(void*)&arg2, *arg3p=(void*)&arg3;
      void *res1p=(void*)&res1, *res2p=(void*)&res2, *res3p=(void*)&res3;
    
      *p++ = 0x60; // pusha
    
      *p++ = 0xA1; // mov eax, [arg1]
      memcpy((void*)p, &arg1p, 4);
      p+=sizeof(int);
    
      // 0FAF05 imul eax,[arg2]
      *p++ = 0x0F;
      *p++ = 0xAF;
      *p++ = 0x05;
      memcpy((void*)p, &arg2p, 4);
      p+=sizeof(int);
    
      // 0FAF05 imul eax,[arg3]
      *p++ = 0x0F;
      *p++ = 0xAF;
      *p++ = 0x05;
      memcpy((void*)p, &arg3p, 4);
      p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      memcpy((void*)p, &res1p, 4);
      p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      memcpy((void*)p, &res2p, 4);
      p+=sizeof(int);
    
      *p++ = 0xA3; // mov [res1],eax
      memcpy((void*)p, &res3p, 4);
      p+=sizeof(int);
    
      *p++ = 0x61; // popa
      *p++ = 0xC3; // ret
    
      funcptr func;
      func.y = buf;
    
      arg1 = 7; arg2 = 11; arg3 = 13;
      res1 = 0; res2 = 0; res3 = 0;
    
      func.x(); // call generated code
    
      printf("args = %d %d %d res = %d %d %d\n",
        arg1, arg2, arg3, res1, res2, res3);
    
      return 0;
    }

  24. #24
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've successfully rewritten effective address encoding routine. Now, it covers almost all modes except direct register addressing (i.e. mov eax, esi where esi is address. one can use mov eax, [esi] instead). All special cases are handled now (including direct addressing). So, I've changed all opcode encoding accordingly. Also, all arguments' addresses are encoded as immediate values into native code. So, no stack access. As a result shorter programs produced. But, it's interesting that speed didn't changed at all (if someone has enough patient to test it with previous version precisely, I would be really happy).

    Seems for a real speed gain we need at least rewrite predict() and update() merge into native code. But, in the end both code requires assembly level optimization which is really hard (register allocation strategies, out-of-order aware instruction reordering etc.). Anyway here is a test while having some background applications:

    No code has to be inserted here.

    BTW, I've covered more commands (~51%) but they are still incomplete (including branches). So, it still only runs fast.cfg or mid.cfg.

    P.S.: Just ignore "Command/Opcode Address Mapping Table". It's related to next release which will possibly involve branches.
    Attached Files Attached Files
    BIT Archiver homepage: www.osmanturan.com

  25. #25
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Matt, thanks for your effort but you really don't need to continue workarounds. I've solved all of my addressing problems. Mod R/M is really poorly documented. Also, special cases are really special cases Now, I can even encode mov eax, [0x12345678 + 4*edi] which is actually not documented everywhere. It's rare
    BIT Archiver homepage: www.osmanturan.com

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Here are some timing tests for enwik7 (first 10 MB of enwik in fast and mid modes (c1 and c2 in zpaq). Times are for compression (c) or decompression (x). The first 7 tests are for zpaq where I modified the optimized predict(), update(), and run() from libzpaqo.cpp by replacing them with the default interpreted functions predict0(), update0(), and run0() one at a time. The tests show that predict() and update() have a somewhat larger effect on performance than run().

    The last 2 tests are with your zpaqJIT2. These are analogous to my last 2 tests where predict() and update() are interpreted and I compare the effects of optimizing only run(). It looks like the effects are similar, i.e. your JIT code is as fast as compiled code.

    Code:
    fast mid
     7.2 23.5 zpaq c (compress, all ZPAQL compiled)
     8.1 24.1 zpaq x (decompress, all ZPAQL compiled)
     9.3 31.0 zpaq x (predict() interpreted)
     9.3 29.7 zpaq x (update() interpreted)
     9.8 27.6 zpaq x (run() interpreted)
    10.5 38.1 zpaq x (predict() and update() interpreted)
    12.7 41.5 zpaq x (all interpreted)
    
    11.6 41.5 unzpaq (predict() and update() interpreted)
    13.3 45.0 unzpaqOriginal (all interpreted)
    Test details: tested on a 2.0 GHz T3200 under 32 bit Vista. Times are wall times with CPU at 98-99% on one of 2 cores. I compiled zpaq with g++ 4.5.0:

    g++ -O2 -march=pentiumpro -s -fomit-frame-pointer -DNDEBUG zpaq.cpp libzpaq.cpp libzpaqo.cpp

    Finally I repeated my last 2 tests compiling with VC++ (cl /O2 /DNDEBUG). It is a bit slower than g++.

    Code:
    11.8 40.4 zpaq x (predict() and update() interpreted)
    14.4 43.7 zpaq x (all interpreted)

Similar Threads

  1. C++ compile-time constant detection
    By Shelwien in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 5th August 2010, 09:11
  2. Better compression performance across time?
    By Trixter in forum Data Compression
    Replies: 16
    Last Post: 17th June 2008, 00:35
  3. Forum improvement
    By Lasse Reinhold in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 13th May 2008, 17:48
  4. nice CM improvement
    By toffer in forum Forum Archive
    Replies: 4
    Last Post: 10th January 2008, 10:21
  5. TC 5.0dev11 is here - Time to gain compression!
    By encode in forum Forum Archive
    Replies: 38
    Last Post: 1st August 2006, 10:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •