hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Web Crawler in hadoop - Unresponsive after a while
Date Fri, 14 Oct 2011 02:03:19 GMT
Aishwarya, you should probably ask on the -user list.
Moreover, you should probably just look at and use Nutch, which uses MR under the hood for
fetching and other tasks - see http://nutch.apache.org/

Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

>From: Aishwarya Venkataraman <avenkata@cs.ucsd.edu>
>To: common-dev@hadoop.apache.org
>Sent: Thursday, October 13, 2011 8:13 PM
>Subject: Web Crawler in hadoop - Unresponsive after a while
>I trying to make my web crawling go faster with hadoop. My mapper just
>consists of a single line and my reducer is an IdentityReducer
>while read line;do
>  #result="`wget -O - --timeout=500 http://$line 2>&1`"
>  echo $result
>I am crawling about 50,000 sites. But my mapper always seems to time out
>after sometime. The crawler just becomes unresponsive I guess.
>I am not able to see which site is causing the problem as mapper deletes the
>output if the job fails. I am running a single node hadoop cluster
>Is this the problem ?
>Did anyone else have a similar problem ? I am not sure why this is
>happening. Can I prevent mapper from deleting intermediate outputs ?
>I tried running mapper against 10-20 sites as opposed to 50k sites and that
>worked fine.
>Aishwarya Venkataraman
>Graduate Student | Department of Computer Science
>University of California, San Diego

View raw message