hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aishwarya Venkataraman <avenk...@cs.ucsd.edu>
Subject Web Crawler in hadoop - Unresponsive after a while
Date Fri, 14 Oct 2011 00:13:37 GMT
Hello,

I trying to make my web crawling go faster with hadoop. My mapper just
consists of a single line and my reducer is an IdentityReducer

while read line;do
  #result="`wget -O - --timeout=500 http://$line 2>&1`"
  echo $result
done

I am crawling about 50,000 sites. But my mapper always seems to time out
after sometime. The crawler just becomes unresponsive I guess.
I am not able to see which site is causing the problem as mapper deletes the
output if the job fails. I am running a single node hadoop cluster
currently.
Is this the problem ?

Did anyone else have a similar problem ? I am not sure why this is
happening. Can I prevent mapper from deleting intermediate outputs ?

I tried running mapper against 10-20 sites as opposed to 50k sites and that
worked fine.

Thanks,
Aishwarya Venkataraman
avenkata@cs.ucsd.edu
Graduate Student | Department of Computer Science
University of California, San Diego

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message