trunk/ just got support Linux Native AIO.
I implemented Async IO based on libaio which is a minimal wrapper around the aio-syscalls for the 2.6.x kernels.
Implementation
It was a bit tricky to get it working as libaio is basicly undocumented, but hey … that’s why we are hackers :)
The async file IO support is part of Linux 2.6.9 and later and should be on every recent linux box. A separate library call libaio is providing very simple wrappers and is used as the base for the new network backend.
The idea is:
- create a buffer in /dev/shm and mmap() it
- start a async read() from the source file to the mmap() buffer
- wait until the data is ready
- use sendfile() to send the data from /dev/shm to the network socket
Important for the performance: the data is never copied into user space. We only move it from one side of the kernel to the other side.
Hack ahead
Sadly I had to add pthread to the dependencies. Having threads in a single-threaded server is a bit strange, but it is necessary.
fdevent_poll() was waiting for fd-events for 1s. While it was waiting the server was waiting. The handling the async-notifications is also blocking and we can’t make them return as soon as one of them is done.
If necessary we start a io-getevent-thread which run in parallel to the fdevent_poll() call. The call which returns first is interrupting the other one by sending a SIGUSR1 to the process. It makes the waiting calls (poll() and io_getevents()) return with a EINTR and we can continue handling the result of one of the two calls.
Benchmarks
As testbed we have a RAID1 (linux md) via two
- ST3160827AS (SATA, 120Mb each)
- nVidia Corporation CK8S as SATA controller
- AMD Athlon™ 64 Processor 3000+
- Linux 2.6.16.21-0.25-xen (SuSE 10.1)
siege, 700Mb
I’ll compare linux-sendfile vs. linux-aio-sendfile.
|conc|non-aio|aio [512k]|aio [1M]|
|1|52.38 MB/sec [9% idle]|89.85 MB/sec [70% idle]|107.50 MB/sec [67% idle] |
|2|39.94 MB/sec [8% idle]|94.52 MB/sec [70% idle]| 92.74 MB/sec [70% idle]
|
|5|35.45 MB/sec [7% idle]|31.81 MB/sec [86% idle]|72.84 MB/sec [70% idle]|
|10|… |25.22 MB/sec [82% idle]| 32.87 MB/sec [90%] idle |
More important than the throughput is the CPU time that can be spent with other tasks now.
What’s next ?
Next is bug fixing, load testing (more parallel connections), random load, …