Threaded stat() 4
Just as a proof of concept I implemented a threaded stat() call. It is a bit of a hack currently, but it looks promising when I look at the performance data:
avg-cpu: %user %nice %system %iowait %steal %idle
5.00 0.00 26.60 68.40 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.60 66.90 1.60 13019.20 22.40 6.36 0.01 190.39 6.10 88.20 14.49 99.28
sdb 0.00 0.60 66.60 1.60 13061.60 22.40 6.38 0.01 191.85 14.09 208.82 14.67 100.04
In http://blog.lighttpd.net/articles/2007/01/27/accelerating-small-file-transfers we tried the same without a async stat() and with fcgi-stat-accel. With the threaded stat() I moved the code into lighttpd itself which reduces the external communicating and manages everything in lighttpd itself.
name Throughput util% iowait% ----------------- ------------ ----- ------------ no stat-accel 12.07MByte/s 81% stat-accel (tcp) 13.64MByte/s 99% 45.00% stat-accel (unix) 13.86MByte/s 99% 53.25% threaded-stat 14.32MByte/s 99% 68.40%
(larger is better)
Implementation
in stat_cache.c I started a separate thread for handling the stat() call, 4 threads to be exact.
stat_cache_get_entry() checks its cache, if this file is already known. If not, it pushes the filename into the stat_cache_queue and returns HANDLER_WAIT_FOR_EVENT. On the other end of the stat_cache_queue is one of the 4 stat()-threads which runs the stat() and pushs the connection back into the joblist_queue. On the mainloop, just where the poll() call is started is now the handler for this queue which just actives all connections which are in this queue.
This way we made the stat() call itself async and can leave the rest of the code as is. Up to now we only get the inode into the fs-buffers as in the other examples, we are not handling the full stat-cache updates in the thread.
gpointer *stat_cache_thread(gpointer *_srv) {
server *srv = (server *)_srv;
stat_job *sj = NULL;
/* take the stat-job-queue */
GAsyncQueue * inq = g_async_queue_ref(srv->stat_queue);
GAsyncQueue * outq = g_async_queue_ref(srv->joblist_queue);
/* get the jobs from the queue */
while ((sj = g_async_queue_pop(inq))) {
/* let's see what we have to stat */
struct stat st;
/* don't care about the return code for now */
stat(sj->name->ptr, &st);
stat_job_free(sj);
g_async_queue_push(outq, sj->con);
}
return NULL;
}
Accelerating Small File-Transfers 14
Thanks to some help from a irc-channel (#lighttpd at irc.freenode.net) we solved another long-standing problem:
As lighttpd is event-based web-server we have problems when it comes to blocking operations. In 1.5.0 we add async sendfile() operations which helps for large files alot. For small files most of the time is spent on the initial stat() call which has no async interface.
Fobax submitted a nice solution for this problem: move the stat() to a fastcgi app which returns with X-LIGHTTPD-send-file: and hands the request back to lighttpd. The fastcgi can block and spend some time while lighttpd moves on the with other requests. When the fastcgi returns the information for the stat() call is in the fs-buffers and lighttpd doesn’t block on the stat() anymore.
All this is documented by darix in the wiki at HowtoSpeedUpStatWithFastcgi
This works with mod_fastcgi in 1.4.0 or with mod-proxy-core in 1.5.0 + aio.
For 1.5.0 I added fcgi-stat-accel to svn and to the cmake build.
I want to on port 1029 as a first test round. The -C 1 is to start only one thread in the back to see the impact later.
$ ./build/spawn-fcgi -f ./build/fcgi-stat-accel -p 1029 -C 1
As config on lighttpd side we have to enable X-Sendfile and keep a few connections open in the pool.
$SERVER["socket"] == ":1025" {
$HTTP["url"] =~ "^/seek-bound/" {
proxy-core.protocol = "fastcgi"
proxy-core.backends = ( "127.0.0.1:1029" )
proxy-core.allow-x-sendfile = "enable"
proxy-core.max-pool-size = 20
}
}
As test-env I used 100k files as in the other tests (10G of data over all).
$ http_load -parallel 200 -seconds 60 urls.100k
iostat said:
$ iostat -xm 5
avg-cpu: %user %nice %system %iowait %steal %idle
9.20 0.00 45.80 45.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 73.00 0.00 13278.40 0.00 6.48 0.00 181.90 7.09 98.30 13.71 100.08
sdb 0.00 0.00 69.20 0.00 12625.60 0.00 6.16 0.00 182.45 13.63 194.71 14.46 100.08
We are limited by the disks now, perhaps we can reduce the CPU usage a bit more by using unix domain sockets instead of TCP:
avg-cpu: %user %nice %system %iowait %steal %idle
8.19 0.00 38.56 53.25 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1.00 67.63 4.30 12533.07 47.95 6.12 0.02 174.91 10.28 144.44 13.89 99.90
sdb 0.00 1.00 66.13 4.30 12442.76 47.95 6.08 0.02 177.35 11.92 168.46 14.18 99.90
The system time drops by 6, good enough.
Summary
Thanks to Fobax great idea I can finally max out my two disks. If you have more disks the impact will be a lot larger. Give it a try.
name Throughput util% ----------------- ------------- --------- no stat-accel 12.07MByte/s 81% stat-accel (tcp) 13.64MByte/s 99% stat-accel (unix) 13.86MByte/s 99%