httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: URL scanning by bots
Date Wed, 01 May 2013 00:08:54 GMT
Ben Reser wrote:
> On Tue, Apr 30, 2013 at 3:03 AM, André Warnier <> wrote:
>> Let us imagine for a moment that this suggestion is implemented in the
>> Apache webservers,
>> and is enabled in the default configuration.  And let's imagine that after a
>> while, 20% of
>> the Apache webservers deployed on the Internet have this feature enabled,
>> and are now
>> delaying any 404 response by an average of 1000 ms.
>> And let's re-use the numbers above, and redo the calculation.
>> The same "botnet" of 10,000 bots is thus still scanning 300 Million
>> webservers, each bot
>> scanning 10 servers at a time for 20 URLs per server.  Previously, this took
>> about 6000
>> seconds.
>> However now, instead of an average delay of 10 ms to obtain a 404 response,
>> in 20% of the
>> cases (60 Million webservers) they will experience an average 1000 ms
>> additional delay per
>> URL scanned.
>> This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the
>> scan.
>> Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3
>> 1/2 hours).
> Let's assume that such a feature gets added, however it's not likely
> going to be the default feature.  There are quite a few places that
> serve a lot of legitimate soft 404s for reasons that I'm not going to
> bother to get into here.

Could you actually give an example of such a "legitimate" use-case ?
(I am not saying that you are wrong, it's just that I genuinely cannot think of such a case)

One comment apart from that, is that if there are indeed such sites, I would imagine that

they are of the kind which is professionally managed, and that it would not be difficult 
in that case for the administrator to disable (or tune) the feature.

> Any site that goes to the trouble of enabling such a feature is
> probably not going to be a site that is vulnerable to what these
> scanners are looking for.  So if I was a bot writer I'd wait for some
> amount of time and if I didn't have a response I'd move on.  I'd also
> not just move along with the next scan on your web server, I'd
> probably just move on to a different host.  If nothing else a sever
> that responds to request slowly is not likely to be interesting to me.
> As a result I'd say your suggestion if wildly practiced actually helps
> the scanners rather than hurting them, because they can identify hosts
> that are unlikely to worth their time scanning with a single request.

Assuming that you meant "widely" ..

Allow me to reply to that (worthy) objection :

In the simple calculations which I indicated initially, I omitted the impact of the 
network latency, and I used a single figure of 10 ms to estimate the average response time

of a server (for a 404 response).

According to my own experiments, average network latency to reach Internet servers (even 
with standard pings) is of the order of magnitude of at least 50 ms.  That is for 
well-connected servers.
So from the bot client's point of view, to the basic server response time for a single 
request, you would have to add at least 50 ms on average.

On the other hand, let me disgress a bit to introduce the rest of the answer.

My professional specialty is information management, and many of my customers have 
databases containing URL links to reference pages on the WWW, which they maintain and 
provide to their own internal users.  From time to time we need to go through their 
databases, and verify that the links which they have stored are still current.
So for these customers we are regularly running programs of the "URL checker" type.  These

are in a way similar to URL-scanning bots, except that they target a longer list of URLs 
(usually several hundred or thousand), usually distributed over many servers, and these 
are real URLs that work (or worked at some point in time).

So anyway, these programs thus try a long list of WWW URLs, and check the type of response

that they get : if they get a 200 then the link is ok; if they get most anything else then

the link is flagged as "dubious" in the database, for further manual inspection.
Since the program needs to scan many URLs in a reasonable time, it has to use a timeout 
for each URL that it is trying to check. For example, it will issue a request to a server,

and if it does not receive a response within (say) 5 seconds, it gives up and flags the 
link as dubious.
Over many runs of these programs, I have noticed that if I set this timeout much below 5 
seconds (say 2 seconds), then I get of the order of 30% or more "false dubious" links.
In reality most of the time these are working links, but it just so happens that many 
servers occasionally do not respond faster than 2 seconds. (And if I re-run the same 
program with the same parameters immediately afterward, I will again get 30% of slow 
links, and many will be different compared to the previous run).
Obviously I cannot do that, because it would mean that my customer has to check hundreds 
of URLs by hand afterward. So on average the timeout is set at 5 seconds, and this is a 
value obtained empirically after many many runs.

What I am leading to is : if the time by which each 404 response is delayed, is randomly 
variable, for example between 50 ms and 2 seconds, then it is very difficult for a bot to

determine if this is a "normal" delay just due to the load on the webserver at that 
particular time, or if this is deliberate, or if this server is just slow in general.
And if the bot gets a first response which is fast (or slow), it doesn't really say 
anything about how fast or slow the next response would be.

That's what I meant when I stated that this scheme would be hard for a bot to circumvent.

I am not saying that it is impossible, but any scheme to circumvent this would need at 
least a certain level of sophistication, which again raises the cost.

And now facetiously again, if what you are writing above about bots detecting this anyway

and consequently avoiding my own websites was correct, then I would be very happy too, 
since I would have found a very simple way to have the bots avoid my servers.

View raw message