Async IO on Linux
trunk/ just got support Linux Native AIO.
I implemented Async IO based on libaio which is a minimal wrapper around the aio-syscalls for the 2.6.x kernels.
Implementation¶
It was a bit tricky to get it working as libaio is basicly undocumented, but hey … that’s why we are hackers :)
The async file IO support is part of Linux 2.6.9 and later and should be on every recent linux box. A separate library call libaio is providing very simple wrappers and is used as the base for the new network backend.
The idea is:
- create a buffer in /dev/shm and mmap() it
- start a async read() from the source file to the mmap() buffer
- wait until the data is ready
- use sendfile() to send the data from /dev/shm to the network socket
Important for the performance: the data is never copied into user space. We only move it from one side of the kernel to the other side.
Hack ahead¶
Sadly I had to add pthread to the dependencies. Having threads in a single-threaded server is a bit strange, but it is necessary.
fdevent_poll() was waiting for fd-events for 1s. While it was waiting the server was waiting. The handling the async-notifications is also blocking and we can’t make them return as soon as one of them is done.
If necessary we start a io-getevent-thread which run in parallel to the fdevent_poll() call. The call which returns first is interrupting the other one by sending a SIGUSR1 to the process. It makes the waiting calls (poll() and io_getevents()) return with a EINTR and we can continue handling the result of one of the two calls.
Benchmarks¶
As testbed we have a RAID1 (linux md) via two
- ST3160827AS (SATA, 120Mb each)
- nVidia Corporation CK8S as SATA controller
- AMD Athlon™ 64 Processor 3000+
- Linux 2.6.16.21-0.25-xen (SuSE 10.1)
siege, 700Mb¶
I’ll compare linux-sendfile vs. linux-aio-sendfile.
$ siege ---reps=1 -c 1 ---benchmark http://127.0.0.1:1025/file-700M
------ -------------------------- --------------------------- ----------------------------
conc non-aio aio [512k] aio [1M]
1 52.38 MB/sec [9% idle] 89.85 MB/sec [70% idle] 107.50 MB/sec [67% idle]
2 39.94 MB/sec [8% idle] 94.52 MB/sec [70% idle] 92.74 MB/sec [70% idle]
5 35.45 MB/sec [7% idle] 31.81 MB/sec [86% idle] 72.84 MB/sec [70% idle]
10 .. 25.22 MB/sec [82% idle] 32.87 MB/sec [90%] idle
------ -------------------------- --------------------------- ----------------------------
More important than the throughput is the CPU time that can be spent with other tasks now.
What’s next ?¶
Next is bug fixing, load testing (more parallel connections), random load, …