httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael <mich...@asstr.org>
Subject Re: Wget
Date Mon, 26 Aug 2002 13:34:49 GMT
	This is a good approach :) You don't want to block wget
(because there are a lot worse things out there, most of which
you will never have heard of). What you want to block is the
particular behavior that all web mirroring programs will exhibit,
specifically harvesting large blocks of pages (real pages, not
"supplementals" like images, etc.) in a very short period of time.

	You can actually use mod_throttle to do that, kind of. The problem
is that if you're seriously interested in blocking that behavior
then chances are you have a lot of traffic, and mod_throttle is
not the best-written program algorithm-wise, nor does it have
some of the features you'll need (like the ability to permanently
un-block certain IPs like cache-*.aol.com). robotcop may be a
good alternative...

	About anti-mirroring software in general... If you had a site
with, say, 165,000 pieces of erotic literature from the benign to
the bizarre you'd do your best to block mirroring (even though
the site is free and ad-free) because the people doing mirroring either
don't understand that they'll dislike 98% of what they download
or they're trying to set up a mirror. Since many of the authors
who contribute content to my site have specifically requested
that they not be published anywhere else, it's part of my job
to try to ensure that their wishes are fulfilled, so mirrors should
ask for permission - not just attempt to download the entire
site willy-nilly.

	The other thing that really gets me is that all the mirroring
software I've come across doesn't even use the transparent gzip
compression feature that manages to keep our bandwidth bills
in a reasonable range, so they're "stealing" twice (once for
downloading stuff they'll never read and once for using more
bandwidth to do it than they need to) from legitimate users of
the site.

- Michael

On Mon, 26 Aug 2002, Wolter Kamphuis wrote:

> Hi,
>
> I also had some problems with webspiders. A website I’m running consists
> of many (1500) pages showing each one image, like a gallery. People who
> wanted to have all the images just let wget do a recursive download of the
> complete website. The result was that almost half of my traffic went to
> those webspiders.
>
> I now use robotcop (http://www.robotcop.org/) to block webspiders. On some
> of my pages (especially dynamic ones) I include a one-pixel image-link.
> Everyone following this link will be blocked for two days. Normal browsers
> won't follow this link so they are unaffected. I catch about 10 to 20
> people a day using wget, teleport pro and more of such spiders.
>
> However, there are some issues using robotcop. There always is a change
> you will block innocent users, about one or two of the spiders I daily
> catch are innocent users. There’s not much I can do about it since I don’t
> know why they follow the ‘invisible link’. Still one or two of 30k
> visitors isn’t that much.
>
> Also, if you have robotcop behave like a tarpit (very slowly serve crap to
> the clients) every caught spider will occupy one (or more) apache
> processes, in that case its easy to perform a dos attack if you have the
> right tools. I found a way to solve this by building a special ‘tarpitd’
> daemon that handles the ‘crap serving’. It also helps against worms and
> people trying to scan apache, scanning my webserver takes hours for it to
> complete.
>
> mzzl
>   Wolter
>
>
> > Is there a way to protect the websites on my server from someone using
> > Wget??
> >
> > Any help is apreciated.
> >
> > TIA.
> >
> > Tom
> >
> >
> > ---------------------------------------------------------------------
> > The official User-To-User support forum of the Apache HTTP Server
> > Project. See <URL:http://httpd.apache.org/userslist.html> for more info.
> > To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
> >    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> > For additional commands, e-mail: users-help@httpd.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Mime
View raw message