hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naganarasimha Garla <naganarasimha...@gmail.com>
Subject Re: node remains unused after reboot
Date Wed, 23 Sep 2015 19:08:19 GMT
Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vvasudev@apache.org> wrote:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <trtrmitya@gmail.com> wrote:
>
> >
> >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications in
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting around
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got tasks
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pending
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

Mime
View raw message