Results 1 to 18 of 18

Thread: Reading to a buffer

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts

    Reading to a buffer

    I have a long standing doubt: if I'm reading to a buffer of size N and there's at least N bytes unread in input stream then will fread/ ReadFileA/ whatever always fill the entire buffer? I'm asking that because if that's true then that would simplify things considerably in some situations, eg in cases where I'm reading data records which are larger than one byte in size.

    Similar question could be asked about writing. Are there any situations where writing could fail once and succeed at the next try? Except running out of disk space of course.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    1. If you're reading from a local file which is not appended by another thread/process,
    then you'd get N bytes when they're available in the file.
    Unfortunately there're also practical cases where its not true - pipe i/o, for example.
    Eg. reading from stdin in a program started as "dir | program" can return without filling the buffer.
    At least, it seems that ReadFile doesn't return with 0 bytes read, even for a pipe.
    So its not necessary to check for EOF using some other API.
    2. Writing is buffered, so even for pipes it usually works as expected.

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I'd say: go by specification. In the long run it's usually cheaper and makes your code more reliable. By depending on implementation-specific behaviour your code may not work correctly is some obscure cases or may stop working any time MS changes their code. And Shelwein, does your comment apply to things like WinCE? After all, it's entirely different beast.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    In Linux, read() or fread() from a pipe or socket will wait for input rather than return 0 if open in blocking mode. In non-blocking mode, read() will return -1 and fread() will return 0.

    http://linux.die.net/man/2/read
    http://linux.die.net/man/3/fread

  5. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Thanks for comments. I want to make my streaming compressors work on standard pipes (stdin and stdout). As it doesn't make sense for compressor to read from non-blocking pipe I'll probably check for that and throw appropriate error notice. If someone would want to compress from non-blocking pipe then he could make something a'la nonblocking2blocking program which would do the workaround.

    I'm thinking also about a special external control mode, in addition to compression and decompression modes. Control mode would read byte stream from stdin containing commands like opening input and output files, transmitting packets of data into and out of de/ compression engine, hibernating to a file (or to a series of packets to stdout), resetting a model, resetting (flushing) range coder, setting compression options, asking for some statistics, etc This way I could easily create a Java app which would communicate with native version of compressor through standard input/ output pipes.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    1. When stdio/stdout are redirected to files, they're not really streams as all the file operations apply to them.
    (ie seek etc). And as a workaround for streams, you can make a wrapper:
    Code:
    uint file_sread( HANDLE file, void* _buf, uint len ) {
      byte* buf = (byte*)_buf;
      uint r;
      uint flag=1;
      uint l = 0;
      do {
        r = 0;
        flag = ReadFile( file, buf+l, len-l, (LPDWORD)&r, 0 );
        l += r;
      } while( (r>0) && (l<len) && flag );
      return l;
    }
    (This is taken from shar source)
    I think its not really a problem to use such a wrapper everywhere, as the comparisons in it are much faster than actual reading anyway.

    2. For interprocess communication, like you described, I prefer to use tcp.
    It can be connected at any time, so the same engine can be used for all operations,
    while with stdio you'd have to restart it in an app where compression is not in the main module.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    1. I'll probably use such a loop when reading files. I will of course need something more complicated for reading variable length commands. And of course I can't use ReadFile for portability reasons - I'm developing on an Ubuntu anyway.

    2. Writing a small TCP server should be easy and that's what I'm planning to do in the long run, but it seems somewhat unnecessary for now as my compressors are still heavily experimental. Such small server could then manage a pool of engines in the case there's many cores/ CPUs and provided that compression throughput is at least somewhat scalable with the number of processes. But locally pipes seems least problematic for single process/ single thread processing. Additionally, such a server could be easily written in Java enabling many additional possibilities.

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    these funny tests just reads cached file using various buffer sizes:

    C:\>C:\Testing\rep\011\fazip.exe 4x4:b32k:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,532,166,685: 100.02% Cpu 1649 mb/s (2.621 sec), real 666 mb/s (6.484 sec) = 40% C:\>C:\Testing\rep\011\fazip.exe 4x4:b64k:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,613,581: 100.01% Cpu 3148 mb/s (1.373 sec), real 1115 mb/s (3.876 sec) = 35%
    C:\>C:\Testing\rep\011\fazip.exe 4x4:b128k:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,337,030: 100.01% Cpu 5327 mb/s (0.811 sec), real 1888 mb/s (2.289 sec) = 35%
    C:\>C:\Testing\rep\011\fazip.exe 4x4:b256k:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,198,750: 100.00% Cpu 6925 mb/s (0.624 sec), real 2367 mb/s (1.826 sec) = 34%
    C:\>C:\Testing\rep\011\fazip.exe 4x4:b512k:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,129,614: 100.00% Cpu 6756 mb/s (0.640 sec), real 2671 mb/s (1.618 sec) = 40%
    C:\>C:\Testing\rep\011\fazip.exe 4x4:b1m:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,095,044: 100.00% Cpu 6442 mb/s (0.671 sec), real 2997 mb/s (1.442 sec) = 47%
    C:\>C:\Testing\rep\011\fazip.exe 4x4:b8m:storing dll4g.dll nul
    100%: 4,531,060,447 -> 4,531,064,796: 100.00% Cpu 4261 mb/s (1.014 sec), real 3017 mb/s (1.432 sec) = 71%

    btw, from tests with reading from real disks i remember the rule - read buffer should be no more than half of disk built-in cache (that is 2..64 mb for modern hdds). it's probably because hdd firmware cannot read-ahead data otherwise. so, 1mb looks like a best fit

    btw, it changes over time. several years ago Christian Martelock measured it with Win XP and founf that 32-64 kb is optimal. Now with Win7 small buffers are so slow..

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts

  10. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    So all-round optimal (ie. for interprocess pipes, IP based protocols, RAM disks, HDDs, SSDs etc) buffers size probably are: 512k for reading and some small value for writing, but the graphs aren't very detailed for small x values. 64k buffer for writing should be good. Or probably some calibration mode could be added and some options to select buffer sizes.

    Anyway, I think that the differences are shockingly big. Read-ahead is probably also implemented on OS level with much bigger buffers, at least reads can be buffered without problem. So the difference should be small, proportional to function calls overhead. OS should take care of optimal buffering (unless you're flushing buffers after every read/ write of course).

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,985
    Thanks
    377
    Thanked 353 Times in 141 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    btw, from tests with reading from real disks i remember the rule - read buffer should be no more than half of disk built-in cache (that is 2..64 mb for modern hdds). it's probably because hdd firmware cannot read-ahead data otherwise. so, 1mb looks like a best fit
    I can confirm that. I tested CHK with various buffer sizes with 240 GB SSD (Corsair Force GT) - with 16 MB buffer I can see notable slowdown. So, I've found 1 MB buffer optimal.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Actually I think its more practical to just always use separate threads for i/o.
    Then the optimal buffer size would only depend on the processing type.
    As to the blocking i/o, I already said before that the best is 1-2M for input buffer
    and 4-8k for output buffer.

    Also I think that it would be useful to also move the file opening out of the main thread too.
    As it is now, most stronger coders have a significant delay at start, it can be seen eg. in the ccm graph in
    http://encode.su/threads/435-compression-trace-tool

    The coder would first open the input and output files, allocate and initialize some large chunks of memory,
    and only after that start doing something useful.
    It would require some creativity though, because there's usually no work to do before finishing the model init.
    But maybe we still can use that time to preallocate the space for output file, open and cache the input file,
    maybe some preprocessing etc.

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Well, separate I/O thread sounds reasonably, but have you measured the impact of buffer size on CPU time overhead? Maybe buffer sizes that aren't multiplies of 4096 or maybe 65535 add some measurable overhead?

    Update:
    As to threaded I/O - I have an idea to keep a queue of many 4 KiB buffers, eg 256, and then perform atomic locking on that buffers. I suppose that's the common way of doing threaded I/O anyway.
    Last edited by Piotr Tarsa; 10th January 2012 at 21:17.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    1. Surely there's a significant overhead when unaligned buffers are used (both address and size; 4k alignment),
    as OS can only read and write whole clusters.
    2. 4k buffers for MT i/o is likely not a good idea - it would increase the sync'ing overhead (also API call overhead).
    Otherwise the buffer size doesn't matter that much imho, with MT.

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Shelwien View Post
    it may be outdated nowadays. in particular, on ram drive i got exactly the same results as from OS cache on hdd/ssd

    btw, NTFS cluster size may be up to 64 kb, so it's a question whether it will be fast with 8kb writes

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Shelwien View Post
    2. 4k buffers for MT i/o is likely not a good idea - it would increase the sync'ing overhead (also API call overhead).
    Otherwise the buffer size doesn't matter that much imho, with MT.
    btw, if we consider "Cpu time" in my test as cpu overhead, it looks like the smallest overheads are in the 256kb-1mb range, then it starts to grow

  17. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    NTFS cluster size may be up to 64 kb, so it's a question whether it will be fast with 8kb writes
    ZFS block is 128k by default.

  18. #18
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    You may also want to consider using a link-layer protocol like AT over Ethernet, if your data is to be local. I know that TCP stack is not necessarily the best network protocol if your compresser is being starved for data from TCP. I found http://code.google.com/p/ggaoed/ but I am sure other AToE implementations exist.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •