6.17.2009

Asynchronous I/O Using Boost

One of the things I've been spending a lot of time on lately is measuring performance and scalability of asynchronous disk I/O versus synchronous I/O for an application that reads and processes gigabytes (or even terabytes) of data off of various types of filesystems such as NTFS, ext, etc. The current code we have that handles the I/O reading is completely synchronous, and uses a rather naive memory model. Imagine something like this (simplified for the sake of brevity):

const
boost::uint16_t chunk_size = 4096; //read 4k at a time
const boost::uint64_t remaining = get_byte_count();
const boost::uint64_t offset = 0;

while (remaining > 0)
{
     boost::uint16_t read_bytes = std::min(chunk_size, remaining);
     std::vector read_buf(read_bytes);
     read_data(offset , read_bytes, &read_buf[0]);

     //process the data
 
     remaining -= read_bytes;
}


Obviously this is silly for a lot of reasons, most notably the re-allocation of a 4KB stack buffer every iteration through the loop. It also doesn't allow the processing of one chunk of data to happen while reading the next chunk of data. If the processing is something that takes a long time like, for example, encrypting or compressing the data, you can basically get all of it for free since disk I/O is usually pretty slow.

Anyway, for the code above, in the more complex setting in which I'm actually using it, with a SATA 3.0 Gb/s interface and a purported maximum sustained data transfer rate of 78MB/s, I get 25MB/s. Pretty god awful.

I tried to do some benchmarking using Intel Thread Profiler and MicroFocus DevPartner Studio but both of them tanked when profiling my code for than a few seconds. I think this was in part due to the producer / consumer model that we had in our legacy codebase, where "producing" involved generating a 4KB chunk of data, and "consuming" involved sending the same 4KB chunk of data across a network connection. Reading this much data happens so fast that we were generating millions and millions of context switches. So many context switches in fact that Intel Thread Profiler would die, I guess because it couldn't deal with the fact that context switches were happening constantly every few microseconds.

I decided to try an asynchronous approach. I didn't think it would buy much in this limited setting because the processing of the data had been left commented out just to get a theoretical upper bound on how fast I could be doing stuff. My initial attempt at this was to use I/O Completion Ports directly through the Windows API. The I/O Completion Model is basically Windows' answer to high performance scalable asynchronous I/O. There are ultimately 4 different categories of I/O:

* synchronous blocking I/O

* synchronous non-blocking I/O

* asynchronous blocking I/O

* asynchronous non-blocking I/O

I/O Completion ports is the 4th type. For details on how to support I/O Completion Ports in your own application please refer to the MSDN Documentation or to Jeffrey Richter's excellent book Windows Via C/C++. From a high level, the programming model using IOCP can be expressed using the following pseudo-code:

//Define constants for the events that we care about.
//We use this code when beginning an operation on an
//IOCP, and IOCP uses this code to notify us of what
//type of operation completed.
typedef enum {
    CustomEvent1,
    CustomEvent2,
    /*...*/
} IocpEventType;

IOCP port = CreateIoCompletionPort(/*...*/);
HANDLE device = CreateFile(/*...*/);


 

 


IocpAssociateDeviceEvent(port, device, CustomEvent1);
IocpAssociateDeviceEvent(port, device, CustomEvent2);

/*...*/

while (!done)
{


    IocpEventType what = GetQueuedCompletionStatus(

port, /*...*/);
    switch (what)
    {
    case CustomEvent1:
       
/*If this was, for example, a read, we should begin
          an asynchronous write to save the data*/
        break;
    case CustomEvent2:
       
//Perhaps a write completed, in which case we can read new 
        //data.
        break;
    }
}

The beauty of this is that non I/O related tasks can easily be fit into this model. In the handling of CustomEvent1, maybe we want to encrypt the data. Before the loop begins we can initialize a thread pool (Windows even provides built in worker thread pool support) and post a message to the thread telling it there is new data to work on. When it's done working on the data, it simply calls PostQueuedCompletionStatus() with a different value of the enumeration such as EncryptionEvent.

Benchmarking this solution showed that I was now reading off the disk at a sustained speed of about 35 MB / second and now I would actually see occasional burst speeds of 150-200MB/second. This was exciting! But not good enough. There are still a few concerns:

1. Portability is a requirement of this application. It must work on Linux.

2. Although I'm not always reading from the disk sequentially, I am reading sequentially more often than not. If I can determine that the next 1MB of data will all be sequential, why should I read 4KB at a time? Or if only the next 8KB of data is sequential, I can still read 2 chunks at a time. How should I determine exactly how much to read?

3. Why was the asynchronous model faster? In neither of the benchmarks was there significant amounts of computation going on with the data. There was the poor stack usage in the first example, but certainly that doesn't account for 10MB/s of lost throughput.

To address the first concern I began looking into Boost.Asio. Although all of the documentation, samples, and pre-built classes are geared towards network usage, I saw no reason I could not fit the model to work with disk I/O, or for that matter arbitrary computations.

In my next post I will give a brief tutorial of Boost.Asio (the documentation leaves a little bit to be desired), what I had to do to fit the model I needed around the Boost.Asio model, and the performance I was able to get out of Boost.Asio. After that, I will discuss how I addressed the second issue, as well as a nice use of inline assembly that, combined with the asynchronous model, allowed me to achieve a massive performance increase for these reads. Finally I offer some thoughts on the 3rd point.

0 comments:

  © Blogger templates ProBlogger Template by Ourblogtemplates.com 2008

Back to TOP