Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of rocksuser@gmail.com
 designates 209.85.214.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKsDg_k0sqNQvkqxqXkkeiPxZ6QemQUeaE6jqtCMNgDJp9=Jbw@mail.gmail.com>
References: 
 <CAKsDg_k0sqNQvkqxqXkkeiPxZ6QemQUeaE6jqtCMNgDJp9=Jbw@mail.gmail.com>
Date: Thu, 3 Apr 2014 11:59:58 -0500
Message-ID: 
 <CAKsDg_=M+z4mhuqhRYQwtVwpoJ9=dB2JhYYSjQp=xqcLjqDPaA@mail.gmail.com>
Subject: Re: YARN App Master logs and other qns
From: Casey K <rocksuser@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e01176279cb073304f6265482

--089e01176279cb073304f6265482
Content-Type: text/plain; charset=ISO-8859-1

I was able to fix address item (2) below.

Looking through the logs, I noticed that the node manager initiated
shutdown but was killed before it could finish. So I increased the value
for YARN_STOP_TIMEOUT from default 5 secs to 10 secs and in some cases 30
secs. Is it normal to have longer than 10 sec timeouts?

On Mon, Mar 31, 2014 at 2:32 PM, Casey K <rocksuser@gmail.com> wrote:

> Hello,
>
> I am fairly new to the Hadoop framework. So I appreciate your patience in
> case my email has not entirely correct or the terminology is wrong. I have
> a working installation. However, I am facing a few issues:
>
> 1) I have run PI example a number of times. The number of slave nodes used
> is 4. Most times the runtime is about 31 secs. Other times, i varies widely
> and goes up to 650 secs. What could be causing this? This is a dedicated
> cluster with no other workloads
>
> 2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
> -9" Every time during shutdown, the nodemanager is forcibly killed because
> it doesnt respond in 5 seconds. I dug through the logs and dont find any
> thing off. One thing I found is noted in (3).
>
> 3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
> Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
> Container complete event for unknown container id
> container_1396286812424_0001_01_000042" My searches indicate this is
> because the connection to the appmaster is lost. I cant seem to find where
> the appmaster logs are
>
> 4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
> so it never starts. My understand is that it starts as a part of RM in this
> case.
>
> 5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
> shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
> Is UDA required for RDMA Shuffle.
>
> 6) If I want to provide support for a new file system, is there a tutorial
> on what all needs to be implemented? I found that
> org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
> code or documentation would help.
>
> Appreciate the help.
>
> Regards,
> Casey
>

--089e01176279cb073304f6265482
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I was able to fix address item (2) below.<div><br></div><div>Looking throug=
h the logs, I noticed that the node manager initiated shutdown but was kill=
ed before it could finish. So I increased the value for YARN_STOP_TIMEOUT f=
rom default 5 secs to 10 secs and in some cases 30 secs. Is it normal to ha=
ve longer than 10 sec timeouts?=A0<br>
<br><div class=3D"gmail_quote">On Mon, Mar 31, 2014 at 2:32 PM, Casey K <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:rocksuser@gmail.com" target=3D"_blank"=
>rocksuser@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">
Hello,<div><br></div><div>I am fairly new to the Hadoop framework. So I app=
reciate your patience in case my email has not entirely correct or the term=
inology is wrong. I have a working installation. However, I am facing a few=
 issues:</div>

<div><br></div><div>1) I have run PI example a number of times. The number =
of slave nodes used is 4. Most times the runtime is about 31 secs. Other ti=
mes, i varies widely and goes up to 650 secs. What could be causing this? T=
his is a dedicated cluster with no other workloads</div>

<div><br></div><div>2) &quot;nodemanager did not stop gracefully after 5 se=
conds: killing with kill -9&quot; Every time during shutdown, the nodemanag=
er is forcibly killed because it doesnt respond in 5 seconds. I dug through=
 the logs and dont find any thing off. One thing I found is noted in (3). =
=A0</div>

<div><br></div><div>3) I see errors as follows: &quot;2014-03-31 12:27:26,9=
75 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.R=
MContainerAllocator: Container complete event for unknown container id cont=
ainer_1396286812424_0001_01_000042&quot; My searches indicate this is becau=
se the connection to the appmaster is lost. I cant seem to find where the a=
ppmaster logs are</div>

<div><br></div>4) If Proxy server needed? I did not set the &quot; yarn.web=
-proxy.address&quot; and so it never starts. My understand is that it start=
s as a part of RM in this case.=A0<div><br></div><div>5) RDMA based shuffle=
 - Mellanox seems to have contributed code for RDMA shuffle instead of HTTP=
. Is this part of YARN? If yes, how do I enable it? Is UDA required for RDM=
A Shuffle.=A0</div>

<div><br></div><div>6) If I want to provide support for a new file system, =
is there a tutorial on what all needs to be implemented? I found that org.a=
pache.hadoop.fs.FileSystem is the class to extend. However, a sample code o=
r documentation would help.=A0<br>

<div><br></div><div>Appreciate the help.</div><div><br></div><div>Regards,<=
/div><div>Casey</div></div>
</blockquote></div><br></div>

--089e01176279cb073304f6265482--