hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Clarke <clarke...@gmail.com>
Subject avoid custom crawler getting blocked
Date Wed, 27 May 2009 08:58:55 GMT
My current project is to gather stats from a lot of different documents.
We're are not indexing just getting quite specific stats for each document.
We gather 12 different stats from each document.

Our requirements have changed somewhat now, originally it was working on
documents from our own servers but now it needs to fetch other ones from
quite a large variety of sources.

My approach up to now was to have the map function simply take each filepath
(or now URL) in turn, fetch the document, calculate the stats and output
those stats.

My new problem is some of the locations we are now visiting don't like their
IP being hit multiple times in a row.

Is it possible to check a URL against a visited list of IPs and if recently
visited either wait for a certain amount of time or push it back onto the
input stack so it will be processed later in the queue?

Or is there a better way?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message