hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover
Date Fri, 25 Jan 2019 10:00:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752106#comment-16752106
] 

Hadoop QA commented on YARN-9195:
---------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 16s{color} | {color:blue}
Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color}
| {color:green} The patch appears to include 3 new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 38s{color} | {color:blue}
Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 30s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 55s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 23s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 32s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 17s{color}
| {color:green} branch has no errors when building and testing our client artifacts. {color}
|
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  0s{color} | {color:blue}
Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 34s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 19s{color} |
{color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 13s{color} | {color:blue}
Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m  8s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 18s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 18s{color} | {color:green}
the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  1m 25s{color}
| {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 16 new + 296 unchanged
- 0 fixed = 312 total (was 296) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 25s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m  5s{color}
| {color:green} patch has no errors when building and testing our client artifacts. {color}
|
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  0s{color} | {color:blue}
Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests
{color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 51s{color} | {color:red}
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client generated 1 new + 0 unchanged - 0 fixed
= 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  2s{color} |
{color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 44s{color} | {color:red}
hadoop-yarn-api in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 31s{color} | {color:green}
hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  4s{color} | {color:green}
hadoop-yarn-server-tests in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 27m 10s{color} | {color:green}
hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 34s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}118m 17s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client |
|  |  Inconsistent synchronization of org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.appAttemptId;
locked 75% of time  Unsynchronized access at AMRMClientImpl.java:75% of time  Unsynchronized
access at AMRMClientImpl.java:[line 204] |
| Failed junit tests | hadoop.yarn.conf.TestYarnConfigurationFields |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9195 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12956285/YARN-9195.001.patch
|
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit
 shadedclient  findbugs  checkstyle  |
| uname | Linux ab0c1347baa4 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 9fc7df8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/23181/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt
|
| findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/23181/artifact/out/new-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.html
|
| unit | https://builds.apache.org/job/PreCommit-YARN-Build/23181/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-api.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23181/testReport/ |
| Max. process+thread count | 692 (vs. ulimit of 10000) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
U: hadoop-yarn-project/hadoop-yarn |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23181/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RM Queue's pending container number might get decreased unexpectedly or even become negative
once RM failover
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9195
>                 URL: https://issues.apache.org/jira/browse/YARN-9195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.1.0
>            Reporter: Shengyang Sha
>            Assignee: Shengyang Sha
>            Priority: Critical
>         Attachments: YARN-9195.001.patch, cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found that pending
container number of one RM queue became negative after RM failed over. Since queues in RM
are managed in hierarchical structure, the root queue's pending containers became negative
at last, thus the scheduling process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are based on yarn
3.1, and we uses AMRMClientAsync#addSchedulingRequests() method in our application to request
resources from RM.
> After investigation, we found that the direct cause was numAllocations of some AMs' requests
became negative after RM failed over. And there are at lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to the numAllocations
for a schedulingRequest. In our batch job scenario, the numAllocations of a schedulingRequest
could turn to zero because theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already recovered some
of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it re-register itself
through RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will removeFromOutstandingSchedulingRequests
once AM gets ContainersFromPreviousAttempts without checking whether these containers have
been assigned before. As a consequence, its outstanding requests might be decreased unexpectedly
even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the latest hadoop
trunk, posted in the attachment. You may try case testAMRMClientWithNegativePendingRequestsOnRMRestart
and testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before removeFromOutstandingSchedulingRequests
in AMRMClientImpl during registerApplicationMaster, and some sanity checks are also needed
to prevent things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message