hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "genericqa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
Date Mon, 27 Nov 2017 11:29:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266665#comment-16266665
] 

genericqa commented on YARN-7560:
---------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 25s{color} | {color:blue}
Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  0s{color} | {color:red}
The patch doesn't appear to include any new or modified tests. Please justify why no new tests
are needed for this patch. Also please list what manual steps were performed to verify this
patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 45s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 38s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 30s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 45s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m  3s{color}
| {color:green} branch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 11s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 27s{color} |
{color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 45s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 41s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 41s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 27s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 41s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 12s{color}
| {color:green} patch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 21s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 25s{color} |
{color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 24s{color} | {color:red}
hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 20s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}126m 45s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | YARN-7560 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12899328/YARN-7560.001.patch
|
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  shadedclient
 findbugs  checkstyle  |
| uname | Linux dd13b8a3271f 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 2bde3ae |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| unit | https://builds.apache.org/job/PreCommit-YARN-Build/18664/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/18664/testReport/ |
| Max. process+thread count | 858 (vs. ulimit of 5000) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/18664/console |
| Powered by | Apache Yetus 0.7.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a overflow value

> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-7560
>                 URL: https://issues.apache.org/jira/browse/YARN-7560
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: zhengchenyu
>            Assignee: zhengchenyu
>             Fix For: 3.0.0
>
>         Attachments: YARN-7560.000.patch, YARN-7560.001.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager
hangs. And the Resourcemanager can't restart successfully. We got jstack information, always
show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable [0x00007f98eed9a000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
>         - locked <0x00007f8c4a8177a0> (a java.util.HashMap)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
>         - locked <0x00007f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c4a76ac48> (a java.lang.Object)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c49254268> (a java.lang.Object)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c467495e0> (a java.lang.Object)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative
value. So the loop can't return. We found in our cluster, the sum of all minRes is over int.max,
so resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But resourceUsedWithWeightToResourceRatio
return int type. Our cluster is so big that resourceUsedWithWeightToResourceRatio will return
a overflow value, just a negative. So the loop will never break.
> {code}
>     while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
>         < totalResource) {
>       rMax *= 2.0;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message