hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aishwarya Venkataraman <avenk...@cs.ucsd.edu>
Subject Re: Web Crawler in hadoop - Unresponsive after a while
Date Fri, 14 Oct 2011 02:51:10 GMT
Thanks Otis, I will check that out.

On Thu, Oct 13, 2011 at 7:03 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Aishwarya, you should probably ask on the -user list.
> Moreover, you should probably just look at and use Nutch, which uses MR
> under the hood for fetching and other tasks - see http://nutch.apache.org/
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> >________________________________
> >From: Aishwarya Venkataraman <avenkata@cs.ucsd.edu>
> >To: common-dev@hadoop.apache.org
> >Sent: Thursday, October 13, 2011 8:13 PM
> >Subject: Web Crawler in hadoop - Unresponsive after a while
> >
> >Hello,
> >
> >I trying to make my web crawling go faster with hadoop. My mapper just
> >consists of a single line and my reducer is an IdentityReducer
> >
> >while read line;do
> >  #result="`wget -O - --timeout=500 http://$line 2>&1`"
> >  echo $result
> >done
> >
> >I am crawling about 50,000 sites. But my mapper always seems to time out
> >after sometime. The crawler just becomes unresponsive I guess.
> >I am not able to see which site is causing the problem as mapper deletes
> the
> >output if the job fails. I am running a single node hadoop cluster
> >currently.
> >Is this the problem ?
> >
> >Did anyone else have a similar problem ? I am not sure why this is
> >happening. Can I prevent mapper from deleting intermediate outputs ?
> >
> >I tried running mapper against 10-20 sites as opposed to 50k sites and
> that
> >worked fine.
> >
> >Thanks,
> >Aishwarya Venkataraman
> >avenkata@cs.ucsd.edu
> >Graduate Student | Department of Computer Science
> >University of California, San Diego
> >
> >
> >
>



-- 
Thanks,
Aishwarya Venkataraman
avenkata@cs.ucsd.edu
Graduate Student | Department of Computer Science
University of California, San Diego

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message