reducing Requests-Setup-Costs

Back in the times of the first implementations of mod-cml I took the request setup costs as the root of all evil. They were the problem I wanted to fix with mod-cml.

But what is the request setup cost ? What is influencing the request-time ? Where can you influence it ?

When you send a request to the browser it has to:

receive the request
parse the request
connect the a back-end
send the request to the back-end
wait for a response
receive the response and send it to the client

Using ab to fire the same request to the back-ends we should hit the caches very well and get a good idea what is left when we have hot caches:

$ ab -n 100 -c 1 http://127.0.0.1:1025/123.php
...
Time per request:       7.515 [ms] (mean) [strace]
...

ab was run against a lighty 1.4.13-r1385 running in strace. We will run all tests with this setup.

$ strace -o costs.txt -tt -s 512 lighttpd -D -f ./lighttpd.conf
...:26.574274 ... (last syscall of the previous request)
...:26.575448 accept(...) = 8
...:26.576006 read(8, ...)
...:26.576702 connect(9, ... )
...:26.577239 writev(9, ... )
...:26.579688 read(9, ...)
...:26.580459 writev(8, ... )
...:26.581128 close(8)

1.2ms for accept()ing the connection
0.5ms for reading the request from the client.
0.7ms for connecting to the backend over a unix-socket
0.5ms to writev() the request data to the backend
2.4ms to wait for the backend and reading the response
0.8ms for writev()ing the response to the client
0.5ms to close the connection again

(sum is 6.8ms)

Without strace the response-time is:



Time per request:       2.946 [ms] (mean)

and goes down to 0.5ms for the request in lighty. The 2.4ms in the backend stay the same.

Keep-Alive

The next attempt is using keep-alive to get rid of the accept() and the close() at the end. In the strace-timing it costed 1.7ms to execute those two calls.

$ ab -n 100 -c 1 -k http://127.0.0.1:1025/foo.php
...
Time per request:       5.564 [ms] (mean) [strace]
...

strace tells us:



…:03.242201 … (the last syscall of the previous request)

…:03.242903 read(8, …)

…:03.243703 connect(9, …)

…:03.244144 writev(9, …)

…:03.246396 read(9, …)

…:03.246969 writev(8, …)

…:03.247261 … (last syscall of this request)

0.5ms for the read() of the request
0.8ms for the connect()
0.4ms for the writev() to the backend
2.2ms waiting + read()ing the response
0.7ms writev()ing the response to the client

(sum is 4.6ms)

Time per request:       2.394 [ms] (mean)

The 2.2ms seen here are spent in the backend and are not affected by the strace. It is the time spent in the backend. If you take them out of the calculation you get:

  100 * (1 - ((2.4ms - 2.2ms) / (2.9ms - 2.5ms))) = 50%
  100 * (1 - ((4.6ms - 2.2ms) / (6.8ms - 2.5ms))) = 44%

50% saving the lighty internal costs. But those 50% saving are only 10% in reality as most of the request-time is spent in the backend.

The Backend

The limiting factor for low request-times is a backend with as little overhead as possible. For the above timings I used:

PHP 5.1.4
XCache 1.0.x

which is executing the script:

<?php echo "123" ?>

What is PHP doing in those 2.2-2.5ms ? strace will help us again.

...:58.351886 ... (last syscall of the previous request)
...:58.352020 accept(0, ...) = 4
...:58.353110 read(4, ...)
...:58.353260 stat() 
...:58.354431 open(...)
...:58.355489 read(5, ...)
...:58.357595 write(4, ...)
...:58.359911 close(4)

0.2ms for the accept()
1.1ms to read the whole fastcgi-request
0.1ms for the stat() to see of the file exists
1.2ms to open the file
1.0ms to read it
2.1ms to execute the script and send the response
2.4ms to close the connection

We have 2 options:

2.6ms are spend to bring up and shutdown the connection. FastCGI has the possibility to use keep-alive. This will be supported lighty 1.5.x
2.2ms are spent in getting the script-file into PHP.

PHP without a byte-code cache is reading twice

I wonder why there is a open()+read() on the file at all as I used XCache 1.0.x here as code-cache and the file is on the cache and is getting cache-hits.

If I remove xcache, the same php-file is read twice by PHP:

open("..../foo.php", O_RDONLY) = 4
fstat64(4, {st_mode=S_IFREG|0664, st_size=26, ...}) = 0
...
fstat64(4, {st_mode=S_IFREG|0664, st_size=26, ...}) = 0
read(4, "<?php\n    print \"123\";\n?>\n", 4096) = 26
_llseek(4, 0, [0], SEEK_SET) = 0
read(4, "<?php\n    print \"123\";\n?>\n", 8192) = 26
read(4, "", 4096) = 0
read(4, "", 8192) = 0
close(4)          = 0

The php-file is open()ed, stat()ed twice [why ?] and read() once, all 26 bytes.

Afterwards we seek to the start (_llseek()), read() the 26 bytes again (the same size that the fstat() told us) and have to call read() two times to really understand that there is really no more data.

update After discussing this with the php core devs it is a feature of the FastCGI/CGI SAPI. As old CLI scripts can be executed with the CGI SAPI it has to support the “#!/usr/bin/php” sequence for shell scripts. For webapps this line has to be skipped. Otherwise it would be printed to the output.

Removing the code gives us:



Time per request:       2.150 [ms] (mean)

Getting Raw

Perhaps it is better to stay away from PHP and use a language which is not providing use my with all the nice features we like about PHP:

the automatic parsing of GET parameters
the mapping from var[] into arrays
file-upload support in the back
output buffering, compression
internal variables like $PHP_SELF, …

As I already wrote a byte-code cache for mod-magnet I stretched the idea a bit and took the byte-code cache, wrapped a FCGI_accept() from the fastcgi-lib around it.

The lua-fastcgi-magnet is really simple and just does:

creates a global lua environment
calls FCGI_accept()
loads the script from the script-cache
creates a empty-script env for the script
registers the print() function to use the fastcgi stdio wrapper
executes the script
goes to FCGI_accept() again

I want to execute the same script and see how the response time is now:

Time per request:       1.594 [ms] (mean)

We saved 0.8ms or 30% of the whole request time.

The strace for magnet is alot simpler:



…:28.541361 … (last call of previous request) 

…:28.541507 accept(0, …) = 3

…:28.541916 read(3, …)

…:28.542397 stat64(…)

…:28.542809 write(3, …)

…:28.544807 close(3)

0.2ms for the accept()
0.4ms for the read()
0.4ms for the stat()
0.5ms to execute the script and write the response
2.0ms to wait for the final packet and closing the connection

If we now had keep-alive in our FastCGI implementation …

We already saw, that the script executing is 0.8ms faster in real-time. If we attach strace to lighty again and run ab, we get what we expect:

Time per request:       4.800 [ms] (mean) [strace]

The 5.6ms from PHP in strace with keep-live is 0.8ms more.

a core-magnet

If you really want spend even less time and don’t have to wait on any external sources you can also try to use mod-magnet to execute the script.

Time per request:       0.584 [ms] (mean)

As I save the connect to the backend and transferring the data to it I can response blazingly fast.

Conclusion

Let’s ask the final question: why do I care about those 0.8ms at all ?

PHP without keep-alive	2.9ms	345 req/s
PHP	2.4ms	416 req/s
PHP patched	2.1ms	465 req/s
lua-fastcgi	1.6ms	625 req/s
mod-magnet	0.6ms	1700 req/s

As the print “123” is the smallest script which really generates output it should show that this is highest request count you can get. You can’t get more than those 416 req/s with PHP from my test machine:

AMD Duron 1.3GHz
640MB RAM DDR
Disk and Network don’t matter as the benchmark is RAM and loopback based

If you need a higher throughput use mod-magnet to offload some work from the backend. Perhaps you can cache some content and handle the response directly in mod-magnet and only have to escalate in a few percent of the caches.

Or use another scripting language which allows you to work more directly on the FastCGI level like:

lighty's life

lighty developer blog

Reducing Requests-Setup-Costs