Results 1 to 12 of 12

Thread: Binary Size-Optimization Algorithm Idea

  1. #1
    Member
    Join Date
    Jan 2012
    Location
    Sopianae
    Posts
    32
    Thanks
    9
    Thanked 0 Times in 0 Posts

    Lightbulb Binary Size-Optimization Algorithm Idea

    Hi!
    I just want to show my new idea about a code optimizing algorithm for you.
    Now it is just in a test-form, and I still have some problem to make a really usable automatic application from this algorithm. But it is not a difficult algorithm - I think - and I'm still amateur in programming, so I hope somebody can help me or make a program for the idea.
    However, I wrote a C# code of it, but it has a bug, and because of that I could make that run with pauses.
    Also, this code is only usable on command-line programs, which do data manipulations on files, and also it is very-very slow.
    And sorry for my English and wrong phrases.

    First about the main idea:
    Imagine a simple Win32 native (or it can be .NET application as well (now, but not for the program, I will explain it later)) program. It contains loads of "strange" and "baffling" block of bytes, (for example we can see this with a Hex-Editor), which can be "only" clear for the machine (e.g. the reason why we say "machine-code"). But in most cases (in my manual tests the 100% of the tested programs) there are some block of bytes, which aren't important (or aren't important in every case) during the running of the program, so they can be changed without the program crashes. In most cases, they can't be exactly deleted from the hexadecimal code of the program, because it has a fixed size for the running and several jump-point (of whatever), do after a direct byte-deleting it crashes terribly again. But as I said, they can be changed, and for example, all of them can be changed to zero bytes (00), so after it the program can be compressed with a little better ratio.
    For example:
    In almost every native Win32 exe, in the start of the code there is a code block, which tells 'This program can't be run in DOS mode' blablabla, when we would try to run the program in pure DOS. But every normal user know, that trying to run the program in DOS mode is a stupid thing, so this block of code is unnecessary, so we can fill the bytes with '00' zero bytes, so that the program can be compressed to smaller and it can still run on Windows.
    In my tests, I made a small native program in C++ which could produce a sine wave sound on a frequency, which I wrote into the console window. I compilled it with all of the size optimizations with Visual C++ 2008 and it was about (or more than) 7 KB. Than I manually searched all of the bytes, which could be cleared to zero-byte and wasn't necessary for the right program running (It was a very hard and boring work still on 7000 character with the permanent tests). When I was done, I packed my version with UPX, and also the original one. The original one - which wasn't modifyted manually - with UPX compression was about 5,5-6 KB, and my modifyted version was almost 4 KB. However, the MS Visual Studio's compiller and code optimizer isn't the best, but I think it shows that in bigger native codes it would be useful.
    But there are some problem and limitation in it's practice:
    First, doing it manually would be extremly, very-very huge and long work. Doing it on that 7 KB native code in that test without any mistake was about 3 hour for me - and the potential applications would take MegaBytes! It's 1000× hard work to optimizing manually with this technology.
    In the other hand, we could implement a programcode of this algorithm, but there is a problem during the testing : in the bigger codes (I mean which have about more than 10 command, so I mean all of the potential programs) there is several blocks, which can be cleared to zero-bytes and the program can start, however it will have crazy bugs in it e.g. in it's appearance. For example, I have a command line audio equalizer, which I modifyted on it's native byte code, and now it produces a strange BitMachine effect on the audios. However, I made a "Good" bug for me in this case, general people would use the applications in it's original form.
    So to make a usable code of the idea/algorithm, we must find out a point programmatically during testing the application, which shows for the code correctly that the modifited application's tested full part (which would be used by the user) works 100% correctly and safe.

    This is the reason, why my code would works only with file-/data.manipulating command line applications.
    First, here is the C# code :

    Code:
    using System;
    using System.IO;
    using System.Diagnostics;
    class a
    {
        static string m = "modify.exe", n, o, p;
        static Process e = new Process();
        static void t()
        {
            e = Process.Start(new ProcessStartInfo("taskkill", "/IM " + m + " /T /F"));
            e.WaitForExit();
        }
    
        static void w(string z) { Console.Write(z); }
        static void Main()
        {
            w("Sample of original program-work : "); o = Console.ReadLine();
            w("Sample of test-work : "); n = Console.ReadLine();
            w("Command-Line parameters (write it carefully!) : "); p = Console.ReadLine();
            w("Maximum time to wait for the executable (in milliseconds): "); var q = int.Parse(Console.ReadLine());
            w("Original EXE (extension is given) : ");
    
            byte[] a = File.ReadAllBytes(Console.ReadLine()+".exe"), j = File.ReadAllBytes(o);
            byte d;
            long b = a.LongLength, c = 64, g = j.LongLength, h;
            var e = new Process();
    
            for (; c < b; c++) if (a[c] != 0) {
                    var f = true;
                    d = a[c]; a[c] = 0;
    
                    try { File.WriteAllBytes(m, a); }
                    catch
                    {
                        t();
                        e = Process.Start(new ProcessStartInfo("taskkill", "/IM WerFault.exe /T /F"));
                        e.WaitForExit();
                        File.WriteAllBytes(m, a);
                    }
    
                    var l = new ProcessStartInfo(m, p);
                    l.WindowStyle = ProcessWindowStyle.Hidden;
                    try
                    {
                        e = Process.Start(l);
                        e.WaitForExit(q);
                    }
                    catch { f = false; }
    
                    if (f && File.Exists(n) && new FileInfo(n).Length == g)
                    {
                        BinaryReader i = new BinaryReader(File.OpenRead(n)), k = new BinaryReader(File.OpenRead(o));
                        for (h = 0; h < g; h++) if (i.ReadByte() != k.ReadByte()) { f = false; break; }
                        i.Close(); k.Close();
                    }
                    else f = false;
    
                    if (!f) a[c] = d;
                    try { File.Delete(n); }
                    catch
                    {
                        t();
                        File.Delete(n);
                    }
    
                    Console.SetCursorPosition(0, 6); Console.Write(c + " / " + b + "  [" + f + "]");
            }}}
    It works as the followings:
    Imagine, that we would like to optimize one of the PAQ8 versions' data-compressing native application with this program. We would use the application at maximum level always later, so the parameter (excluding the names) would always '-9'.
    The optimizer program have to test the differences always, after every cleared byte, and it must use a reference point/data, which was produced by a correct running. So we take a small file (imagine about 10 KB), called 'testfile.dat', and we compress it with the PAQ8 application, so the outcome will be 'testfile.paq8'. Everybody know, that in this case we used the 'paq8 -9 testfile.dat' command in the command line window. Then we keep the original ('testfile.dat') and the compressed ('testfile.paq8') test file, because we will need them later.
    Then we run the optimizer code. First it ask the sample datafile of the program-work, which in this case is 'testfile.paq8', the compressed testfile. It will be used for comparing that if the modifited binary code works right. Secondly, we have to give a filename for the permanent testings. Pay attention, that later (in the next point) we have to use the same name with this outfile-name. This is the filename, which will be produced by the modifited version permanently and it will be compared with the static test file ('testfile.paq'). For example, give 'testfile2.paq8'.
    Next, we have to give the command line parameters for the application, which used to run/use it. Pay attention, which I've written above, that if a filename have to be given for the application's product in it's parameters, it must be the same with the sample of test-work, in the second input. So in this case we have to input something like this : "-9 testfile2.paq8 testfile.dat"
    And at last, we have to give the application's name, which we would like to "optimize". Then the program will shows how many bytes have been checked to replacing with a zero-byte, and if the replacing/clearing produced a "positive" test.
    The main problem now - excluded the speed - is that in most cases if the testing produce a program-crash, the modifited application/EXE must be closed correctly before the next step.
    As you can see, I've also tried the Windows' taskkill command for it, but sometimes it throw an exception in this case too, that the program still used, and the code stops. Maybe - I think - sometimes it takes a little more time to unload the EXE from the memory.
    But WHAT SHOULD I DO?

    Any ideas how to How to proceed?, What next? Is those right which I did?

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Just because you can zero out a byte of code and it still passes one test, doesn't mean the program isn't broken.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    btw, i've used such technique to develop crash-proof SREP/Tornado decompression - it gets a compressed file, tries to modify every byte in it and runs the decompressor on the modified file. crashes were manually recorded

  4. #4
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    btw, i've used such technique to develop crash-proof SREP/Tornado decompression - it gets a compressed file, tries to modify every byte in it and runs the decompressor on the modified file. crashes were manually recorded
    It seems that you're finding zzuf

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    sort of it, but i've tried all 256 byte modifications over the first 256 bytes or so. and it just modified input file on disk instead of doing black I/O magic

  6. #6
    Member
    Join Date
    Jan 2012
    Location
    Sopianae
    Posts
    32
    Thanks
    9
    Thanked 0 Times in 0 Posts
    Yes, maybe one of the options or parts of the program will be broken, but this is the reason why I said it works on specialized command line programs. And if I want "optimise" one command line program e.g. for UPX for me in this way, I want to keep just that options of the program which I would use always in that case.
    As I said in the post, for example I would use the compressor always at level 9 (I know it needs a loads of memory at level 9, but I have a HiTech PC, and I would use the compressor on small files...). If I passed all test and cleared all "unnecessary" bytes, maybe the compressor would not work on the other levels and wouldn't recognise the other parameters again ("-0", "-1", "-2", ... "-8"), or for example it wouldn't work with directories again, etc., but as I said I would like to use the "optimalised" version at the highest level and from command line, so those options are unnecessary.
    And if I checked the output always with the same file and this output is minimum some hundred bytes big, and all of the comparisons was right between the sample-file and the output file during the clearing, it is really foolproof, that it will works with other inputs.

  7. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It should work.
    Just be sure not to compress anything else but that file with the same options, on the same computer (incl. unchanged system settings, installed programs, HDD contents etc.). Otherwise, the technique will fail.
    ADDED:
    The list is likely incomplete. It's hard to predict all failure cases.
    ADDED: And I can give you a good example.

    LZ4. The high level overview of the codec is extremely simple:
    Code:
    if file is < 64K
        compress it with a small-file mode
    else
        compress it with a big-file mode
    Simple, isn't it?
    But let's say that you have a 10 KB test file. Your optimiser will zero out the entire big file mode because on your example it's never used. And you'll end up with a program that crashes on anything bigger than 64K.
    Last edited by m^2; 17th March 2012 at 11:48.

  8. #8
    Member
    Join Date
    Jan 2012
    Location
    Sopianae
    Posts
    32
    Thanks
    9
    Thanked 0 Times in 0 Posts
    Ok, I understand. So if the compresser is using an other method for small files and an other for big files, I have to check all the two case.
    In this case, I think it can be solved, if the "optimizer" program would use a batch command file (*.bat, you know) for running - which would be written by the user -, and for the file-checkings the optimizer would use a complete file list, that which output have to be compared which sample file - however, this filelist have to be written by the user too.
    It is true, that it needs more attention, competence and manuality from the user, but I sad, this program is still an idea and test, so it is not for average users now.
    I am right?

  9. #9
    Member
    Join Date
    Jan 2012
    Location
    Sopianae
    Posts
    32
    Thanks
    9
    Thanked 0 Times in 0 Posts
    But which I wrote in the first post, that I still don't know, how can I close a program (or crashed program especially) forced to be able to rewrite it again - especially in C#!
    Any idea?!

  10. #10
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by paqfan View Post
    Ok, I understand. So if the compresser is using an other method for small files and an other for big files, I have to check all the two case.
    In this case, I think it can be solved, if the "optimizer" program would use a batch command file (*.bat, you know) for running - which would be written by the user -, and for the file-checkings the optimizer would use a complete file list, that which output have to be compared which sample file - however, this filelist have to be written by the user too.
    It is true, that it needs more attention, competence and manuality from the user, but I sad, this program is still an idea and test, so it is not for average users now.
    I am right?
    The point is that you need to understand the meaning of each byte before you zero it out. A blind 'let's change it and see if things break' won't work because unless you understand what does the byte do, you can't predict what can you break by changing it.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    1. I also have a "mapper" tool like that, which patches the .exe, runs the test, and keeps the patch if test passes.
    (its a perl script though). I tried to use it for extraction of relevant code for reverse-engineering though, but
    I guess it can be in theory useful for HP entries and such.

    2. Note that it matters what kind of bytes you're trying to patch the code with.
    For example, 00 00 ... 00 runs turn into some ADD instructions with memory targets,
    so results can be very weird.
    The best idea is usually to patch with 0xCC bytes (because it turns into a breakpoint instruction),
    but compression-wise 0x90 (NOP) would be likely better, because there may be some useless
    instructions in the middle of the code which could be removed, but still have to be replaced with
    something executable.
    Anyway, its actually very hard to do such a coverage analysis perfectly - for example, there may
    be an unnecessary PUSH instruction, but it only can be removed along with the matching POP instruction,
    or otherwise there'd be problems with stack.

    3. As to exception handling, I made a script which runs the target program in debug mode and handles
    its exceptions, I can post a link if necessary.
    It also should be possible to create a suspended process and inject a SEH frame into it before running.
    But unfortunately it seems that such an approach doesn't cover 100% of possible exceptions, because
    the OS loader also can crash sometimes, if we'd mess with .exe structures.

  12. #12
    Member
    Join Date
    Feb 2012
    Location
    China
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    If you modify the original data, but can not restore them, your program won't be a loseless compressor anymore, maybe an optimizer?
    Even you make clear what every byte in the file means, had better not to zero them.

Similar Threads

  1. C++ optimization guide
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 4
    Last Post: 25th May 2012, 12:23
  2. which tool has the samllest size/orignal size * time
    By l1t in forum Data Compression
    Replies: 7
    Last Post: 27th August 2010, 06:10
  3. bzip2 dictionary size
    By Wladmir in forum Data Compression
    Replies: 3
    Last Post: 7th April 2010, 16:09
  4. New 7zip SFXs optimization !!!
    By Yuri Grille. in forum Data Compression
    Replies: 6
    Last Post: 4th May 2009, 23:42
  5. Parameter optimization
    By toffer in forum Data Compression
    Replies: 42
    Last Post: 30th June 2008, 00:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •