hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anfernee Xu <anfernee...@gmail.com>
Subject Re: In Yarn how to increase the number of concurrent applications for a queue
Date Tue, 09 Sep 2014 02:15:13 GMT
It turned out that it's not a configuration issue, some worker thread which
submits job to Yarn was blocked, see below thread dump

"pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
native_blocked
    -- Blocked trying to get lock:
org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
    at __lll_lock_wait+36(:0)@0x340260d594
    at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
    at jrockit/vm/Threads.sleep(I)V(Native Method)
    at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
    at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
    at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
    at
org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

The lock was held by

"pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
native_waiting
    at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
    at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
    at
syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
    at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
    at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
    at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
    at java/lang/Thread.sleep(J)V(Native Method)
    at
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
[recursive]
    at
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
lock]
    at
org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

You can see the thead holding the lock is in sleep state and the calling
method is Connection.handleConnectionFailure(), so I checked the our log
file and realized the connection failure is about historyserver is not
available. In my case, I did not start historyserver at all, because it's
not needed(I disabled log-aggregation), so my question is why the job
client was still trying to talk to historyserver even log aggregation is
disabled.

Thanks



On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <acm@hortonworks.com> wrote:

> How many nodes do you have in your cluster?
>
> Also, could you share the CapacityScheduler initialization logs for each
> queue, such as:
>
> 2014-08-14 15:14:23,835 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2014-08-14 15:14:23,840 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Initializing default
> capacity = 0.5 [= (float) configuredCapacity / 100 ]
> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
> maxCapacity = 1.0 [= configuredMaxCapacity ]
> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
> userLimit = 100 [= configuredUserLimit ]
> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
> 100.0f) * userLimitFactor) ]
> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
> ]
> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
> (userLimit / 100.0f) * userLimitFactor),1) ]
> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
> absoluteCapacity)]
> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
> minimumAllocationMemory) / maximumAllocationMemory ]
> numContainers = 0 [= currentNumContainers ]
> state = RUNNING [= configuredState ]
> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
> nodeLocalityDelay = 0
>
>
> Then, look at values for maxActiveAppsUsingAbsCap &
> maxActiveApplicationsPerUser. That should help debugging.
>
> thanks,
> Arun
>
>
> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <anfernee.xu@gmail.com> wrote:
>
>> Hi,
>>
>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>> all my jobs are uberized and running among 2 queues, one queue takes
>> majority of capacity(90%), another take 10%. What I found is for small
>> queue, only one job is running for a given time, I tried twisting below
>> properties, but no luck so far, could you guys share some light on this?
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>     <description>
>>       Maximum percent of resources in the cluster which can be used to run
>>       application masters i.e. controls number of concurrent running
>>       applications.
>>     </description>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.queues</name>
>>     <value>default,small</value>
>>     <description>
>>       The queues at the this level (root is the root queue).
>>     </description>
>>   </property>
>>
>>  <property>
>>
>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>     <value>1</value>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>     <value>88</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>     <value>12</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>     <value>88</value>
>>     <description>
>>       The maximum capacity of the default queue.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>     <value>12</value>
>>     <description>Maximum queue capacity.</description>
>>   </property>
>>
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
--Anfernee

Mime
View raw message