lighty's life

lighty developer blog

Async IO on Linux

trunk/ just got support Linux Native AIO.

I implemented Async IO based on libaio which is a minimal wrapper around the aio-syscalls for the 2.6.x kernels.

Implementation

It was a bit tricky to get it working as libaio is basicly undocumented, but hey … that’s why we are hackers :)

The async file IO support is part of Linux 2.6.9 and later and should be on every recent linux box. A separate library call libaio is providing very simple wrappers and is used as the base for the new network backend.

The idea is:

  1. create a buffer in /dev/shm and mmap() it
  2. start a async read() from the source file to the mmap() buffer
  3. wait until the data is ready
  4. use sendfile() to send the data from /dev/shm to the network socket

Important for the performance: the data is never copied into user space. We only move it from one side of the kernel to the other side.

Hack ahead

Sadly I had to add pthread to the dependencies. Having threads in a single-threaded server is a bit strange, but it is necessary.

fdevent_poll() was waiting for fd-events for 1s. While it was waiting the server was waiting. The handling the async-notifications is also blocking and we can’t make them return as soon as one of them is done.

If necessary we start a io-getevent-thread which run in parallel to the fdevent_poll() call. The call which returns first is interrupting the other one by sending a SIGUSR1 to the process. It makes the waiting calls (poll() and io_getevents()) return with a EINTR and we can continue handling the result of one of the two calls.

Benchmarks

As testbed we have a RAID1 (linux md) via two

  • ST3160827AS (SATA, 120Mb each)
  • nVidia Corporation CK8S as SATA controller
  • AMD Athlon™ 64 Processor 3000+
  • Linux 2.6.16.21-0.25-xen (SuSE 10.1)

siege, 700Mb

I’ll compare linux-sendfile vs. linux-aio-sendfile.

$ siege —reps=1 -c 1 —benchmark http://127.0.0.1:1025/file-700M


|conc|non-aio|aio [512k]|aio [1M]|
|1|52.38 MB/sec [9% idle]|89.85 MB/sec [70% idle]|107.50 MB/sec [67% idle] |
|2|39.94 MB/sec [8% idle]|94.52 MB/sec [70% idle]| 92.74 MB/sec [70% idle]
|
|5|35.45 MB/sec [7% idle]|31.81 MB/sec [86% idle]|72.84 MB/sec [70% idle]|
|10|… |25.22 MB/sec [82% idle]| 32.87 MB/sec [90%] idle |

More important than the throughput is the CPU time that can be spent with other tasks now.

What’s next ?

Next is bug fixing, load testing (more parallel connections), random load, …