hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@gmail.com>
Subject Re: Hadoop JobTracker Hanging
Date Tue, 22 Jun 2010 17:20:14 GMT
There was also https://issues.apache.org/jira/browse/MAPREDUCE-1316
whose cause hit clusters at Yahoo! very badly last year. The situation
was particularly noticeable in the face of lots of jobs with failed
tasks and a specific fix that enabled OutOfBand heartbeats. The latter
(i.e. the OOB heartbeats patch) is not in 0.20 AFAIK, but still the
failed tasks could be causing it.


On Tue, Jun 22, 2010 at 3:47 PM, Steve Loughran <stevel@apache.org> wrote:
> Bobby Dennett wrote:
>> Thanks all for your suggestions (please note that Tan is my co-worker;
>> we are both working to try and resolve this issue)... we experienced
>> another hang this weekend and increased the HADOOP_HEAPSIZE setting to
>> 6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
>> heap space" errors in the jobtracker log. We are now looking into the
>> resource allocation of the master node/server to ensure we aren't
>> experiencing any issues due to the heap size increase. In parallel, we
>> are also working on building "beefier" servers -- stronger CPUs, 3x more
>> memory -- for the node running the primary namenode and jobtracker
>> processes as well as for the secondary namenode.
>> Any additional suggestions you might have for troubleshooting/resolving
>> this hanging jobtracker issue would be greatly appreciated.
> Have you tried
>  * using compressed object pointers on java 6 server? They reduce space
>  * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked
> using right up until oracle stopped giving away the updates with security
> patches. It has a way better heap as well as compressed pointers for a long
> time (==more stable code)
> I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary NN
> that use more, especially if the files are many and the blocksize small. the
> JT should not be tracking that much data over time

View raw message