hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Web Crawler in hadoop - Unresponsive after a while
Date Fri, 14 Oct 2011 18:30:09 GMT
You would probably be happier using an industrial strength crawler.

Check out Bixo.

http://bixolabs.com/about/focused-crawler/



On Thu, Oct 13, 2011 at 5:13 PM, Aishwarya Venkataraman <
avenkata@cs.ucsd.edu> wrote:

> Hello,
>
> I trying to make my web crawling go faster with hadoop. My mapper just
> consists of a single line and my reducer is an IdentityReducer
>
> while read line;do
>  #result="`wget -O - --timeout=500 http://$line 2>&1`"
>  echo $result
> done
>
> I am crawling about 50,000 sites. But my mapper always seems to time out
> after sometime. The crawler just becomes unresponsive I guess.
> I am not able to see which site is causing the problem as mapper deletes
> the
> output if the job fails. I am running a single node hadoop cluster
> currently.
> Is this the problem ?
>
> Did anyone else have a similar problem ? I am not sure why this is
> happening. Can I prevent mapper from deleting intermediate outputs ?
>
> I tried running mapper against 10-20 sites as opposed to 50k sites and that
> worked fine.
>
> Thanks,
> Aishwarya Venkataraman
> avenkata@cs.ucsd.edu
> Graduate Student | Department of Computer Science
> University of California, San Diego
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message