hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: Web crawler in hadoop - unresponsive after a while
Date Fri, 14 Oct 2011 04:55:26 GMT
Hi Aishwarya
        To debug this issue you necessarily don't need the intermediate output. If there is
any error/exception then you can get it from your job logs directly. In your case the job
turns irresponsive, to do further trouble shooting  you can include log statements on your
program and then rerun the same and obtain the records that creates the problem from your
logs.
       In a direct manner you can obtain your logs from the job tracker web UI. http://<host>:50030/jobtracker.jsp.
From your job drill down to the task and on the right side you can see options to display
your task tracker logs. 
       On top of this i'd like to add on, since you mentioned  single node, I assume it is
either on stand alone/distributed mode. These setup is basically for development and testing
of functionality. If you are looking for better performance of your jobs, you  need to leverage
the parallel processing power of hadoop. You need to have  a mini cluster at least for performance
bench marking and processing relatively large volume data.

Hope it helps!..

------Original Message------
From: Aishwarya Venkataraman
Sender: avenkata@eng.ucsd.edu
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Web crawler in hadoop - unresponsive after a while
Sent: Oct 14, 2011 08:20

Hello,

I trying to make my web crawling go faster with hadoop. My mapper just
consists of a single line and my reducer is an IdentityReducer

while read line;do
  #result="`wget -O - --timeout=500 http://$line 2>&1`"
  echo $result
done

I am crawling about 50,000 sites. But my mapper always seems to time out
after sometime. The crawler just becomes unresponsive I guess.
I am not able to see which site is causing the problem as mapper deletes the
output if the job fails. I am running a single node hadoop cluster
currently.
Is this the problem ?

Did anyone else have a similar problem ? I am not sure why this is
happening. Can I prevent mapper from deleting intermediate outputs ?

I tried running mapper against 10-20 sites as opposed to 50k sites and that
worked fine.

Thanks,
Aishwarya



Regards
Bejoy K S
Mime
View raw message