<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>lighty's life: delay request handling for stupid crawlers</title>
    <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description></description>
    <item>
      <title>delay request handling for stupid crawlers</title>
      <description>&lt;p&gt;I&amp;#8217;m sure you know what  &amp;#8220;Crawl-Delay&amp;#8221; is, but you may or may not know that, not all search engine crawlers support this nice stuff.&lt;/p&gt;


	&lt;p&gt;What to do for those don&amp;#8217;t obey the instrustion? They&amp;#8217;ll eat all your Mbits/month or slow your webserver down. OK, ban it with url.access-deny. This is the only option u can choose before. But you don&amp;#8217;t want to remove your pages from the stupid search engine index, do you?&lt;/p&gt;


	&lt;p&gt;Here comes another option for you: with this patch, u can delay handling of a specified request for some seconds. Example configuration:&lt;/p&gt;


	&lt;pre&gt;&lt;code&gt;$HTTP["user-agent"] =~ "stupid-crawler" {
    connection.delay-seconds = 2
}&lt;/code&gt;&lt;/pre&gt;


	&lt;p&gt;OK, here&amp;#8217;s the link to the &lt;a href="http://linuxfire.com.cn/~moo/files/code/lighttpd-2296-request-handle-delay.patch"&gt;lighttpd-2296-request-handle-delay.patch&lt;/a&gt; which applies to &lt;a href="http://trac.lighttpd.net/trac/browser/branches/lighttpd-1.4.x?rev=2296"&gt;branches/lighttpd-1.4.x@2296&lt;/a&gt;&lt;/p&gt;


	&lt;p&gt;Be aware that this patch is to be reviewed before commited to repo.&lt;/p&gt;
</description>
      <pubDate>Fri, 22 Aug 2008 11:50:00 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:e6d631a1-ae9f-4c95-b369-960309e743f7</guid>
      <author>moo</author>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers</link>
      <category>lighttpd</category>
      <trackback:ping>http://blog.lighttpd.net/articles/trackback/5565</trackback:ping>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by Guy</title>
      <description>mOo,&lt;br /&gt;
Thanks for creating lighttpd.  It's really nice.
Typing 'u' instead of 'you' isn't endearing you to anyone and I fear that it would give the impression to some folks that this was put together by an inexperienced, immature kid who couldn't possibly write anything of value.&lt;br /&gt;
&lt;br /&gt;
 i.e. - it could needlessly hinder uptake and I'd hate to see that happen to such a fine software package.</description>
      <pubDate>Sat, 13 Sep 2008 00:16:19 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:baa1f6c1-e849-4e40-9bf4-157651da0666</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5635</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by icy</title>
      <description>Sorry but I don't get it.&lt;br&gt;
m0o has done a great job creating a nice open source application that you get without paying. And you complain about his usage of "u" instead of "you"?&lt;br&gt;
What about saying "thank you" instead? :)</description>
      <pubDate>Wed, 10 Sep 2008 23:12:09 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:f4bea751-0f1c-4b88-8200-068ef6a67a3c</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5632</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by master@masters.com</title>
      <description>Would it kill you to type out you as opposed to "u"?  You come across as a young texting ninja that is out of touch with the english language.</description>
      <pubDate>Wed, 10 Sep 2008 19:42:30 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:36d54ce4-c49f-4157-8b80-dc55f515aa75</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5631</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by mOo</title>
      <description>it is not using hash table or so, but the timer timeout and timer for connection is not either. see trunk/src/server.c if (handle_sig_alarm) { ... for (ndx = 0; ndx used; ndx++) {</description>
      <pubDate>Thu, 28 Aug 2008 02:37:46 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:0882f87b-d844-46ce-b98c-7cd7dfb87c25</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5586</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by fcicq</title>
      <description>I found mod_evasive (in lighty) is not using a hashtable. I did a benchmark with apache 2.0+mod_evasive &amp; lighty with mod_evasive, in a low concurrent connections both works well, if i have many connections, apache works great but lighty works poor.

@mOo: http code 503 is only for Robots :) &amp; if it received a http 503, it will know it exists and perform a crawl later.</description>
      <pubDate>Mon, 25 Aug 2008 16:41:44 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:81f99d56-07bf-4647-a377-1ed5063a4ab4</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5578</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by Jan Berger</title>
      <description>awesome hint to stop the stupid MSN crawlers as they are still running amok on my site. they produce more hits than my normal users :) </description>
      <pubDate>Sun, 24 Aug 2008 23:04:42 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:b24ab992-1e0a-4727-b7dd-df46025f8909</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5575</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by mOo</title>
      <description>"or if the rate of a same IP / Useragent is too high, we can randomly return http code 503? I think it wont hurt. " - u may try mod_evasive.c which does not support "randomly" however. yet i don't think "randomly" is meanful here.</description>
      <pubDate>Fri, 22 Aug 2008 15:43:41 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:59542d3f-411c-418a-a1ef-4a05b76fda7b</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5574</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by mOo</title>
      <description>reply to "It seems to me that delaying the request by n seconds will not change the overall rate of crawling - it'll just offset the responses by n seconds, and increase the concurrent connections (and thus memory usage) to the server during that time"

this may not be true, i don't think any crawler will open 1024 connections to you. well, i mean, they can make less connections if you're fast, or more if you're slow, but they'll open __at most__ like 10 or 20 connections no matter how slow your server responses. or they even open 20 connectios to you no matter how fast it is. how knows.

connection resource is cheap in lighttpd itself because it's just light. this is not true in apache/php/mysql. i once have usleep(200) or so in php side for specified bots, it helped a lot reducing system load, but yes it waste some memory by idling as php is heavy.

as i said, this patch is to deal with bots, not evil guys who use bot as a tool and adjust the concurrent connection count. u can analyze most of the crawler bots that hit your server and figure out their patten. limit then each individually, e.g.: for bot a which opens 20 connections, delay each request for 20 seconds, so it is 1 req/s in average. for bot opens 3 connections, delay 6 seconds and it will be 6 req/s. perhaps in one day i can implement a score table so it can be sum up for each ip, u can just set n req per m seconds in config.</description>
      <pubDate>Fri, 22 Aug 2008 15:42:21 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:3377bea3-538c-4fe4-929e-2b7695cba67c</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5573</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by mOo</title>
      <description>in reply to "Or is there some aspect of the patch that I'm missing": this patch simply just delay n seconds for each individual request, regardless of ip nor connection (keep-alive). i know it's far away to be graceful against the dirty clients from the world.

if you need more advanced version, something like request priority may have to be implemented, and score algorithm can be configured or move to plugins blahblah. the hard part is not about implementing the "idea" part in c code, but to design the alghorithm, and write support functions, find code inject points, or even rewrite stuffs, trying to make it the best practical frameworks

however i just made a simple version and i hope it's somehow effective
</description>
      <pubDate>Fri, 22 Aug 2008 15:35:37 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:db3c56d7-2adc-4470-88bc-33afc0ca8c69</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5572</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by fcicq</title>
      <description>I love the idea too. can we have a first-byte.delay-seconds?
or if the rate of a same IP / Useragent is too high, we can randomly return http code 503? I think it wont hurt.</description>
      <pubDate>Fri, 22 Aug 2008 13:47:45 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:2da21b9b-e2b8-4c9b-b86d-897778ab52fb</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5568</link>
    </item>
    <item>
      <title>"delay request handling for stupid crawlers" by Paul Annesley</title>
      <description>&lt;p&gt;It seems to me that delaying the request by n seconds will not change the overall rate of crawling - it'll just offset the responses by n seconds, and increase the concurrent connections (and thus memory usage) to the server during that time.&lt;/p&gt;

&lt;p&gt;Or is there some aspect of the patch that I'm missing?&lt;/p&gt;

&lt;p&gt;Also - I seem to remember that some crawlers take site response time into account when ranking results.  If that is the case, this could make the site look awfully slow to the crawler.&lt;/p&gt;

&lt;p&gt;That said - I like the idea :)  But perhaps more of a rate-limiting approach would work better?&lt;/p&gt;</description>
      <pubDate>Fri, 22 Aug 2008 13:14:14 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:6388ecd6-7ca2-453b-8090-9e985d0080c2</guid>
      <link>http://blog.lighttpd.net/articles/2008/08/22/delay-request-handling-for-stupid-crawlers#comment-5566</link>
    </item>
  </channel>
</rss>
