hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: avoid custom crawler getting blocked
Date Wed, 27 May 2009 10:02:37 GMT
Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has
solved this kind of problem.


On Wed, May 27, 2009 at 9:58 AM, John Clarke <clarkemjj@gmail.com> wrote:
> My current project is to gather stats from a lot of different documents.
> We're are not indexing just getting quite specific stats for each document.
> We gather 12 different stats from each document.
> Our requirements have changed somewhat now, originally it was working on
> documents from our own servers but now it needs to fetch other ones from
> quite a large variety of sources.
> My approach up to now was to have the map function simply take each filepath
> (or now URL) in turn, fetch the document, calculate the stats and output
> those stats.
> My new problem is some of the locations we are now visiting don't like their
> IP being hit multiple times in a row.
> Is it possible to check a URL against a visited list of IPs and if recently
> visited either wait for a certain amount of time or push it back onto the
> input stack so it will be processed later in the queue?
> Or is there a better way?
> Thanks,
> John

View raw message