hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "genericqa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13837) Always get unable to kill error message even the hadoop process was successfully killed
Date Fri, 16 Feb 2018 04:46:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366608#comment-16366608
] 

genericqa commented on HADOOP-13837:
------------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 29s{color} | {color:blue}
Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  0s{color} | {color:red}
The patch doesn't appear to include any new or modified tests. Please justify why no new tests
are needed for this patch. Also please list what manual steps were performed to verify this
patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 27s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 41s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 30s{color}
| {color:green} branch has no errors when building and testing our client artifacts. {color}
|
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 14s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m  4s{color}
| {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 11s{color}
| {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color}
| {color:green} patch has no errors when building and testing our client artifacts. {color}
|
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 15s{color} | {color:green}
hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 25s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 53m 30s{color} | {color:black}
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | HADOOP-13837 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12900444/HADOOP-13837.05.patch
|
| Optional Tests |  asflicense  mvnsite  unit  shellcheck  shelldocs  |
| uname | Linux 69217260963f 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8013475 |
| maven | version: Apache Maven 3.3.9 |
| shellcheck | v0.4.6 |
|  Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/14145/testReport/ |
| Max. process+thread count | 341 (vs. ulimit of 5500) |
| modules | C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
|
| Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/14145/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Always get unable to kill error message even the hadoop process was successfully killed
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13837
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: HADOOP-13837.01.patch, HADOOP-13837.02.patch, HADOOP-13837.03.patch,
HADOOP-13837.04.patch, HADOOP-13837.05.patch, check_proc.sh
>
>
> *Reproduce steps*
> # Setup a hadoop cluster
> # Stop resource manager : yarn --daemon stop resourcemanager
> # Stop node manager : yarn --daemon stop nodemanager
> WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill
-9
> ERROR: Unable to kill 20325
> it always gets "Unable to kill <nm_pid>" error message, this gives user impression
there is something wrong with the node manager process because it was not able to be forcibly
killed. But in fact, the kill command works as expected.
> This was because hadoop-functions.sh did not check process existence after kill properly.
Currently it checks the process liveness right after the kill command
> {code}
> ...
> kill -9 "${pid}" >/dev/null 2>&1
> if ps -p "${pid}" > /dev/null 2>&1; then
>       hadoop_error "ERROR: Unable to kill ${pid}"
> ...
> {code}
> when resource manager stopped before node managers, it always takes some additional time
until the process completely terminates. I tried to print output of {{ps -p <nm_pid>}}
in a while loop after kill -9, and found following
> {noformat}
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> ...
> {noformat}
> in the first 3 times of the loop, the process did not terminate so the exit code of {{ps
-p}} are still {{0}}
> *Proposal of a fix*
> Firstly I was thinking to add a more comprehensive pid check, it checks the pid liveness
until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too much complexity. Second fix
was to simply add a {{sleep 3}} after {{kill -9}}, it should fix the error in most cases with
relative small changes to the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message