hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Kishore Bonagiri <write2kish...@gmail.com>
Subject Re: Container allocation fails randomly
Date Tue, 17 Sep 2013 09:47:30 GMT
Hi Omkar,

  Thanks for the quick reply, and sorry for not being able to get the
required logs that you have asked for.

  But in the meanwhile I just wanted to check if you can get a clue with
the information I have now. I am seeing the following kind of error message
in AppMaster.stderr whenever this failure is happening. I don't know why
does it happen, the getProgress() call that I have implemented
in RMCallbackHandler could never return a negative value! Doesn't this
error mean that this getProgress() is giving a -ve value?

Exception in thread "AMRM Heartbeater thread"
java.lang.IllegalArgumentException: Progress indicator should not be
negative
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)

Thanks,
Kishore


On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <ojoshi@hortonworks.com> wrote:

> Can you give more information? logs (complete) will help a lot around this
> time frame. Are the containers getting assigned via scheduler? is it
> failing when node manager tries to start container? I clearly see the
> diagnostic message is empty but do you see anything in NM logs? Also if
> there were running containers on the machine before launching new ones..
> then are they killed? or they are still hanging around? can you also try
> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
> check if you can see any message?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi,
>>   I am using 2.1.0-beta and have seen container allocation failing
>> randomly even when running the same application in a loop. I know that the
>> cluster has enough resources to give, because it gave the resources for the
>> same application all the other times in the loop and ran it successfully.
>>
>>    I have observed a lot of the following kind of messages in the node
>> manager's log whenever such failure happens, any clues as to why it happens?
>>
>> 2013-09-12 08:54:36,204 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:37,220 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:38,231 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:39,239 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:40,267 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:41,275 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:42,283 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:43,289 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>>
>>
>> Thanks,
>> Kishore
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Mime
View raw message