hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball ...@cs.washington.edu>
Subject Re: TaskTrackers disengaging from JobTracker
Date Thu, 30 Oct 2008 04:49:57 GMT
Just as I wrote that, Murphy's law struck :) This did not fix the issue 
after all.

I think the problem is occurring because a huge amount of network 
bandwidth is being consumed by the jobs. What settings (timeouts, thread 
counts, etc), if any, ought I dial up to correct for this?

Thanks,
- Aaron

Aaron Kimball wrote:
> It's a cluster being used for a university course; there are 30 students 
> all running code which (to be polite) probably tests the limits of 
> Hadoop's failure recovery logic. :)
> 
> The current assignment is PageRank over Wikipedia; a 20 GB input corpus. 
> Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 
> reduce tasks.
> 
> I wrote a patch to address the NPE in JobTracker.killJob() and compiled 
> it against TRUNK. I've put this on the cluster and it's now been holding 
> steady for the last hour or so.. so that plus whatever other differences 
> there are between 18.1 and TRUNK may have fixed things. (I'll submit the 
> patch to the JIRA as soon as it finishes cranking against the JUnit tests)
> 
> - Aaron
> 
> 
> Devaraj Das wrote:
>>
>> On 10/30/08 3:13 AM, "Aaron Kimball" <aaron@cloudera.com> wrote:
>>
>>> The system load and memory consumption on the JT are both very close to
>>> "idle" states -- it's not overworked, I don't think
>>>
>>> I may have an idea of the problem, though. Digging back up a ways 
>>> into the
>>> JT logs, I see this:
>>>
>>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 4 on 9001, call killJob(job_200810290855_0025) from
>>> 10.1.143.245:48253: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)

>>>
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j

>>>
>>> ava:37)
>>> at java.lang.reflect.Method.invoke(Method.java:599)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>>
>>>
>>>
>>> This exception is then repeated for all the IPC server handlers. So I 
>>> think
>>> the problem is that all the handler threads are dying one by one due 
>>> to this
>>> NPE.
>>>
>>
>> This should not happen. IPC handler catches Throwable and handles that.
>> Could you give more details like the kind of jobs (long/short) you are
>> running, how many tasks they have, etc.
>>
>>> This something I can fix myself, or is a patch available?
>>>
>>> - Aaron
>>>
>>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <acm@yahoo-inc.com> 
>>> wrote:
>>>
>>>> It's possible that the JobTracker is under duress and unable to 
>>>> respond to
>>>> the TaskTrackers... what do the JobTracker logs say?
>>>>
>>>> Arun
>>>>
>>>>
>>>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>>>>
>>>>  Hi all,
>>>>> I'm working with a 30 node Hadoop cluster that has just started
>>>>> demonstrating some weird behavior. It's run without incident for a few
>>>>> weeks.. and now:
>>>>>
>>>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>>>> continually during this time. Then suddenly it will be the case 
>>>>> that all
>>>>> 29
>>>>> TaskTrackers will get disconnected from the JobTracker. All the 
>>>>> tracker
>>>>> daemon processes are still running on each machine; but the JobTracker
>>>>> will
>>>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>>>> fixes
>>>>> this for another 90--120 minutes.
>>>>>
>>>>> This looks similar to 
>>>>> https://issues.apache.org/jira/browse/HADOOP-1763,
>>>>> but
>>>>> we're running on 0.18.1.
>>>>>
>>>>> I found this in a TaskTracker log:
>>>>>
>>>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: 
>>>>> Caught
>>>>> exception: java.io.IOException: Call failed on local exception
>>>>>   at java.lang.Throwable.<init>(Throwable.java:67)
>>>>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>>>   at
>>>>>
>>>>>
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>

>>
>> )
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)

>>>>>
>>>>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>>>> Caused by: java.io.IOException: Connection reset by peer
>>>>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>>>   at
>>>>>
>>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j

>>>>>
>>>>> ava:55)
>>>>>   at
>>>>>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)

>>>>>
>>>>>   at
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)

>>>>>
>>>>>   at
>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)

>>>>>
>>>>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>>>   at
>>>>>
>>>>>
>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>

>>
>> )
>>>>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>>>   at
>>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)

>>>>>
>>>>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>>>>
>>>>>
>>>>> As well as a few of these warnings:
>>>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>>>> THREADS
>>>>> ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
>>>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>>>>> THREADS: SocketListener0@0.0.0.0:50060
>>>>>
>>>>>
>>>>>
>>>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>>>>> because all DNS is served through /etc/hosts files. NameNode and
>>>>> JobTracker
>>>>> are on the same machine.
>>>>>
>>>>> Any help is appreciated
>>>>> Thanks
>>>>> - Aaron Kimball
>>>>>
>>>>
>>
>>

Mime
View raw message