Results 1 to 17 of 17

Thread: On Fast I/O speed

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts

    On Fast I/O speed

    Hello

    I'm currently completing a long time compression algorithm, still belonging to the "faster than HDD" league but with more interesting compression ratio.

    For such speedy algorithm, good I/O handling is an important part of the overall time spent by the program.

    I'm trying to collect some information on how to do this nicely, in order to replace my pretty crude implementation, in effect within "Zhuff" for example. Quite unexpectedly, i'm finding little value up to now. This is surprising, i was expecting this matter to be somewhat 'solved' by today.

    The way i guess some time could be saved is by parallelizing HDD Read/write with compression itself. Simple idea, more complex to execute.

    Parallelizing, does that mean multi-threading ?
    If it does, then my worries start with dependance to OS level calls.

    An ideal solution would use system calls which are generic enough to not require entire rewrite when moving from Windows to another.
    fopen/fclose do belong to this category.
    CreateThread, HANDLE, WaitForSingleObject, for example, does not look that way.
    Note that "asynchronous" windows read/write belong to OS-specific category too.

    Another completely different road could be to "expect" the OS to be clever enough to do the background job by itself.
    For example, by sequentially loading small parts of source file, maybe the OS can be smart enough to start loading the next chunk before actually requiring it. I guess this can work, but up to a size limit.
    Now, could that work for writing too ?

    I'm in "search mode" right now, and would love reading any information in this field.
    Alternatively, a working file i/o source code could help (although i'm pretty bad at reading other's code).

    Regards

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,594
    Thanks
    251
    Thanked 1,135 Times in 624 Posts
    > CreateThread, HANDLE, WaitForSingleObject, for example, does not look that way.

    You're supposed to use some available multi-platform libraries, like pthreads
    or openmp.
    But these are bloated (at least on windows), and imho its better to
    just add multi-platform wrappers for a few necessary nonportable functions.

    For an example, see http://www.quicklz.com/qpress-039-source.zip

    > Another completely different road could be to "expect" the OS to be clever enough to do the background job by itself.

    Actually windows does exactly that (probably linux too, but i didn't check) -
    that's why its so hard to get any improvement with simple ideas.
    But with a single thread, you have to call OS for i/o and wait
    until it processes your request, which is wasted time anyway, even
    if the only thing done by OS in your thread is copying of data buffer.

    And as to optimal input and output buffers sizes with a single thread,
    there was that:
    http://encode.dreamhosters.com/showp...8&postcount=17

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > CreateThread, HANDLE, WaitForSingleObject, for example, does not look that way.

    You're supposed to use some available multi-platform libraries, like pthreads
    or openmp.
    But these are bloated (at least on windows), and imho its better to
    just add multi-platform wrappers for a few necessary nonportable functions.

    For an example, see http://www.quicklz.com/qpress-039-source.zip
    Last time I checked, IO was holding qpress way back, comparing to quick, so I don't think it's a good example.
    Unless it changed, of course.

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Thanks Shelwien,
    actually i was going through the threads you mention


    qpress seems an interesting source code, with some working pthread implementation. I have a hard time catching how it works though, and it seems tied to QuickLZ. There are certainly some good ideas within, but i may need quite some time to understand them.

    Your own article is pretty straightforward. and btw, bcopy i currently running on my system, getting some figures (i'm getting sometimes an error message, stating "process tried to write into non-existant pipe" )
    I'm not sure to properly understand the graph, but your wording is very clear : target around 1MB chunk for read, while write chunks should be as small as possible (well, up to 4K...)
    Edit : after a few tests with bcopy, it seems i get a moot point for writing by about 64KB

    With that it mind, it is possible to write something, and get a first idea.
    Last edited by Cyan; 7th March 2010 at 23:36.

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts

    and what about Ring Memory Buffer ?

    As a sidenote, as i'm working on both at the same time :
    i'm trying to build a ring-buffer implementation, the idea being it allows to keep data from previously loaded chunks in memory for long-range matches.

    A quick&dirty implementation would require to go through the list of pointers, and discard all invalid pointers each time a new chunk is loaded.
    This sounds wastefull (my current implementation uses 512K pointers, so that would be a pretty long job to go through them).

    So i'm trying to find a better implementation, if that's possible.
    One idea, would be to use a relative index.
    This one has a flaw too, regarding files larger than 4GB, but this can be solved later on.
    This seems an interesting idea if a memory area can be accessed using :

    character = start_of_ring_buffer[ index & buffersize_mask];
    or
    BYTE* character = start_of_ring_buffer + (index & buffersize_mask) ;
    This would require an addition operation for each access, which is probably not performance free.
    But discarding too old pointers would be possible on-the-fly, by storing the index instead of the pointer, and controlling its "age".
    Is "on-the-fly" more efficient than discarding pointers directly in the list when loading a new chunk, that's a good question...

    Another option would be to use index value as a pointer, directly, avoiding the addition. But this is only possible if :
    start_of_ring_buffer = 0;
    something which i'm not aware if that's possible...

    I've also considered the properties of aligned memory area, such as :
    __declspec (align(buffersize)) BYTE* start_of_ring_buffer;
    which ensures that (index & buffersize_mask) is the same as (&character & buffersize_mask).
    But i'm no longer certain this brings any benefits.
    Moreover, i don't know how to do that with malloc() or equivalent...

    Well, any advise / suggestion / opinion on this ?

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,525
    Thanks
    748
    Thanked 672 Times in 364 Posts
    CCM author once wrote (on old forum) that optimum is 32-64kb and i got the same result. try to find this old thread

    Moreover, i don't know how to do that with malloc() or equivalent...
    p=(malloc(size+7)+3)&~3
    Last edited by Bulat Ziganshin; 8th March 2010 at 00:28.

  7. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Cyan View Post
    As a sidenote, as i'm working on both at the same time :
    i'm trying to build a ring-buffer implementation, the idea being it allows to keep data from previously loaded chunks in memory for long-range matches.

    A quick&dirty implementation would require to go through the list of pointers, and discard all invalid pointers each time a new chunk is loaded.
    This sounds wastefull (my current implementation uses 512K pointers, so that would be a pretty long job to go through them).

    So i'm trying to find a better implementation, if that's possible.
    One idea, would be to use a relative index.
    This one has a flaw too, regarding files larger than 4GB, but this can be solved later on.
    This seems an interesting idea if a memory area can be accessed using :



    This would require an addition operation for each access, which is probably not performance free.
    But discarding too old pointers would be possible on-the-fly, by storing the index instead of the pointer, and controlling its "age".
    Is "on-the-fly" more efficient than discarding pointers directly in the list when loading a new chunk, that's a good question...

    Another option would be to use index value as a pointer, directly, avoiding the addition. But this is only possible if :
    start_of_ring_buffer = 0;
    something which i'm not aware if that's possible...

    I've also considered the properties of aligned memory area, such as :

    which ensures that (index & buffersize_mask) is the same as (&character & buffersize_mask).
    But i'm no longer certain this brings any benefits.
    Moreover, i don't know how to do that with malloc() or equivalent...

    Well, any advise / suggestion / opinion on this ?
    There is aligned malloc for Windows:
    http://msdn.microsoft.com/en-us/libr...8VS.80%29.aspx

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,594
    Thanks
    251
    Thanked 1,135 Times in 624 Posts
    > One idea, would be to use a relative index.
    > This one has a flaw too, regarding files larger than 4GB,
    > but this can be solved later on.

    Its only a problem if some pointer would remain untouched
    while processing 4G of data... which is rare, and also
    doesn't really matter if you verify the matches anyway.

    But as an alternative, you can store 64-bit offsets...
    or buffer index + offset... its not really as slow as it sounds.

    > Moreover, i don't know how to do that with malloc() or equivalent...

    A page (4k) align makes sense for most large buffers.
    And especially if they're used for i/o.
    But unfortunately gcc has a problem with static aligns more than 16.

    And there're some aligned mallocs in VS, but you can easily
    implement one yourself, like
    Code:
    void* my_alloc( int size ) {
      char* p = malloc( size + align-1 );
      p += align-1;
      p -= p % align;
      return p;
    }

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Thanks, yes, exactly, i have been considering it,
    but initially, i though i should look for a fully aligned buffer,
    which meant :

    char* start_of_ring_buffer = malloc ( 2 * buffersize );
    start_of_ring_buffer += buffersize -1;
    start_of_ring_buffer -= start_of_ring_buffer % buffersize ;
    which obviously is much too wasteful (buffersize can be large, a few hundred of megabytes...).

    However, if it is meant for I/O speed, there is no need for a fully aligned buffer, and this can be done with much smaller overhead.

    to m^2 : yes, i've been using VirtualAlloc up to now, which allows aligned allocation. and i guess _aligned_malloc is probably pointing to it in background. Now, if the same can be done using non-OS specifics, then the better.
    Last edited by Cyan; 8th March 2010 at 02:14.

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    Fast IO is completion-signaled IO

    (Just background info about fast file IO.)

    Typically, you want to give the OS a long list of things to write to the file(s), and then let it tell you when it's done.

    By giving it a long list of things, it has the opportunity to reorder the reads/writes etc optimally and also with insight into all the other IO the system is being asked to do by all other tasks at the same time.

    This is quite the opposite to asychronous ("non-blocking") socket IO, which is typically ready based. And its a lot less well documented

    Windows has perhaps the best API for this, called "Completion Ports".

    In Linux, its a bit uglier but its there and its called Asynchronous IO (AIO) (libaio).

  11. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Nice hint, i will have a look into it.
    Thanks willvarfar !

    Edit : there is a nice website, int64.org, from an experienced windows programmer, which gives some very interesting information on completion port. Looks like a pretty good bet (although implementation is OS specific...)
    Last edited by Cyan; 8th March 2010 at 19:01.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,594
    Thanks
    251
    Thanked 1,135 Times in 624 Posts
    The old ways of multiplexing I/O still work pretty well for scenarios with a low number of concurrent connections, but when writing a server daemon to handle a thousand or even tens of thousands of concurrent clients at once, we need to use something different.
    I still think that creating 1-2 extra threads and doing fread/fwrite there
    is a better idea for a compressor which only reads one file and writes another.

  13. #13
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Its a natural progression.

    It all starts with fread() and fwrite() - these are blocking calls. The operating system does its best to anticipate the reads and buffer things up, but the writes always go to disk and then your program continues.

    (You can mount some filesystems buffered writes on some operating systems, but only the crazy do that.)

    Now its hard to imagine a compression program that thrashes the disk like, say, a webserver or database does. But people are usually surprised how *slow* disk is. Something like SATA has a throughput of serveral GBps. The disk on the end of the SATA cable, though - if its mixing reads and writes - might manage 4 MBps. Yes you read that right. Bytes not bits, but still. If you're only writing it can go a bit faster, and if your only reading it can go much faster, but still in the MB territory.

    There are 'O_NONBLOCK' and such flags for file descriptors, but they're for sockets and such rather than regular files. If you think it through, its inappropriate for regular files.

    You can also configure write buffer sizes (setbuf) on a per-file basis on some platforms; all this is doing is gathering small writes in your file and turning them in occasional big writes; its not giving the operating system visibility of the write until it becomes available. A good way to see if this is impacting is to use fputc instead of fwrite, and see if your program is faster.

    So people go and put the reading and writing in a separate thread, with a queue to and from it...

    Lets take that solution apart. It means you have some queues and locks and such, and two or more threads, and the reading and writing is still using fwrite. Its just giving you more code to get right - especially the locking around the message queue. And fundementally you still have a blocking call - it'll only write as quick as it can round-trip each write to the disk.

    The operating system is far cleverer at writing things to disk for two reasons: 1) its written by people who really know what they're doing with files 2) it has system knowledge - it knows what you are doing, and also what everyone else - including itself with swap files and memory mapped files - is reading and writing. It can reorder reads and writes so that they are in the optimal order for the disk heads and such. IO scheduling is an art-form and the various operating systems are putting a lot of work into the subject as we speak.

    However, the OS can only reorder your reads and writes relative to other apps, since you are using fread/fwrite which only give it one read or write at a time. readv and writev hardly help in this regard, they are still fundementally one logical read or write.

    So in the last 10 years 'completion based signalling' for IO has been introduced. You give the OS as much to read and write as possible, in an asychronous way, and it'll tell you as bits get read and written. It allows for all the reordering goodness within the scope of your app and not just relative to other apps, and also allows your application to be doing other things (compressing?) while the IO is outstanding, and you don't need buckets of threads to do it.

    So we can see that from a technical standpoint, completion based IO is the way to do it if you want to move beyond fread and fwrite. Putting your fread and fwrite into dedicated IO threads is not a lot of gain for quite a lot of pain.

    But, really, for a compression program that is a sequential operation on a file, anything like this might be overkill. You really need profiling that says that your program stalls doing IO before you invest in this.

    So my advice: 1) profile 2) use fputc instead of fwrite and profile again, and report back results 3) if its justified to 'spin off' your IO, go completion-based IO without bothering with worker thread blocking IO.
    Last edited by willvarfar; 9th March 2010 at 10:23.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,594
    Thanks
    251
    Thanked 1,135 Times in 624 Posts
    I'm not really sure what kind of algo you assume there,
    but compressors in general look like this:
    Code:
      do {
        (buf,len) = read_input( bufsize );
        (buf2,len2) = process_data( buf,len );
        write_output( buf2, len2 );
      } while( len==bufsize );
    Of course, quite a few people use getc/putc all over the program
    instead, but that's not worth talking about, it can be optimized
    to buffered i/o and memory processing without any platform-specific methods.

    But then, there's a problem of blocking.
    That is, read_input() and write_output() waste some time even
    though OS normally only has to copy some memory buffers there
    (with some common sense buffer sizes which was discussed before).

    So its possible to improve the processing speed by handling the i/o
    in other thread(s), along with (de)compression.
    Like that, we can read more data into a new buffer while processing
    the old buffer in other thread, etc.
    And implementing that basically requires only one platform-specific
    function - CreateThread or similar. Well, also Sleep or SuspendThread,
    but that won't improve the processing speed on multicore system, though
    would reduce the load.
    Thus, its easy to write a multiplatform wrapper function for that
    and make a portable compressor with async i/o.

    Now, considering "completion ports", its basically a windows-specific
    library with complex enough API for efficient implementation of such
    processing routines. And its only noticeable benefit is that it removes
    the necessity to do something about i/o threads when they don't have
    work, which would be 1-2 calls per data block.
    In fact, its quite believable that while handling 1000 streams at once
    that would be a noticeable improvement (though likely not large anyway).
    But in the specific case with Cyan's compressor the i/o threads likely would
    be fully loaded and the processing thread would have to wait instead.
    So "completion ports" won't do anything except making the program
    hard to port.

    Cyan: also note that it makes sense to do fopens asynchronously
    as well - especially the one for reading.
    Its a bit hard to invent a job for the main thread which it can do
    before having access to input data, but still, at least stuff like
    init of statistical model, lookup table precalculation etc, can be
    done there. Probably, preallocation of large buffers can be placed
    there too.
    Last edited by Shelwien; 9th March 2010 at 11:05.

  15. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Thanks for input.

    In order to provide a bit more information, here is in pseudo-code what i have in mind :

    async_read_first_chunk();
    while (!eof)
    {
    async_read_next_chunk();
    check_current_chunk_is_read(); // <== blocking

    compress_current_chunk();

    async_write_current_chunk();
    check_previous_chunk_is_written(); // <== blocking

    previous_chunk++; next_chunk++; current_chunk++;
    }
    Note there is still some blocking wait. The Read one is necessary to ensure that data is properly loaded before being processed, and the Write one is just to avoid queuing too many "write" operations.

    This is basically the equivalent of "double buffer" applied to file i/o.

    Like for graphic, this could be extended to "triple buffer" or even more, but i hardly doubt that it would really improve performance.
    The basic point is : ensure CPU is doing compression while I/O is being performed on "previous_segment write" and "next_segment read".

    I've already implemented this once in the past, for LZP2, using CreateThread and WaitForSingleObject for the blocking wait :
    hWriteThread = CreateThread(NULL,0,WriteAsync,&segment[framenb],0,&rstart);
    It worked properly with windows XP, although with minimal gains, but crashed with windows Seven. I therefore stopped using it, due to unstability. Now i may simply have not programmed it properly...
    If Completion Ports can do it efficiently, then why not...

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,525
    Thanks
    748
    Thanked 672 Times in 364 Posts
    double buffering may be not enough since compression speed isn't constant. some blocks may be compressed faster, some slower

    anyway, 32-64 kb buffers works great on windows. it's just hard to believe that solution is so easy
    Last edited by Bulat Ziganshin; 9th March 2010 at 22:01.

  17. #17
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts

    qpress

    Quote Originally Posted by m^2 View Post
    Last time I checked, IO was holding qpress way back, comparing to quick, so I don't think it's a good example.
    Unless it changed, of course.
    qpress was single threaded until version 0.20 or 0.30 (I can't check the exact version number right now). So you might want to benchmark it again.

    It's using pthreads where each thread function does following in a loop:

    1) read chunk
    2) compress chunk
    3) ensure other threads have written all prior cunks
    4) write chunk

    With 2 or more threads, I/O should not be able to hold back performance - it overlaps pretty well and on all operating systems.

    Another way would be to have worker threads and one main thread which hands out work to the workers.

    Qpress is slightly bound to using quicklz because it expects qlz_size_compressed() and qlz_size_decompressed() functions to exist and that their information is stored in a header of the compressed data so that you need only to read 9 bytes from disk to call them.

    All the levels.c stuff is to be able to select compression level at runtime. Should be easy to remove.
    Last edited by Lasse Reinhold; 14th March 2010 at 17:06.

Similar Threads

  1. Decompression speed test
    By m^2 in forum Data Compression
    Replies: 8
    Last Post: 18th August 2010, 00:42
  2. Compression and speed
    By Wladmir in forum Data Compression
    Replies: 4
    Last Post: 25th April 2010, 13:15
  3. Modern BWT speed
    By GerryB in forum Data Compression
    Replies: 4
    Last Post: 5th May 2009, 18:28
  4. GCC 4.4 and compression speed
    By Hahobas in forum Data Compression
    Replies: 14
    Last Post: 5th March 2009, 18:31
  5. Compression speed benchmark
    By Sportman in forum Forum Archive
    Replies: 104
    Last Post: 23rd April 2008, 17:38

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •