hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <daeme...@gmail.com>
Subject Re: Max Connect retries
Date Mon, 09 Feb 2015 18:23:15 GMT
Are your nodes actually stuck or are you in e.g. a reduce step that is
reading so much data across the network that the node SEEMS unreachable?


Since you mention "gets stuck for a while at 25%", that suggests that
eventually the node finishes up its work ...



*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, Feb 9, 2015 at 2:49 AM, Telles Nobrega <tellesnobrega@gmail.com>
wrote:

> Thanks
>
> On Mon Feb 09 2015 at 01:43:24 Xuan Gong <xgong@hortonworks.com> wrote:
>
>>  That is for client connect retry in ipc level.
>>
>> You can decrease the max.retries by configuring
>>
>> ipc.client.connect.max.retries.on.timeouts
>>
>> in core-site.xml
>>
>>
>>  Thanks
>>
>>  Xuan Gong
>>
>>   From: Telles Nobrega <tellesnobrega@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Saturday, February 7, 2015 at 8:37 PM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Max Connect retries
>>
>>   Hi, I changed my cluster config so a failed nodemanager can be
>> detected in about 30 seconds. When I'm running a wordcount the reduce gets
>> stuck in 25% for a quite while and logs show nodes trying to connect to the
>> failed node:
>>
>>  org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-telles-844fb3f0-dfd8-456d-89c3-1d7cfdbdcad2/10.3.2.99:49911.
Already tried 28 time(s); maxRetries=45
>> 2015-02-08 04:26:42,088 INFO [IPC Server handler 16 on 50037] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
MapCompletionEvents request from attempt_1423319128424_0025_r_000000_0. startIndex 24 maxEvents
10000
>>
>> Is this the expected behaviour? should I change max retries to a lower values? if
so, which  config is that?
>>
>> Thanks
>>
>>
>>

Mime
View raw message