hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14855) Hadoop scripts may errantly believe a daemon is still running, preventing it from starting
Date Fri, 08 Sep 2017 21:59:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159435#comment-16159435

Allen Wittenauer commented on HADOOP-14855:

(I'm having a total deja vu moment right now.  I wish I could remember who else I discussed
this issue with a few years ago. haha.)

It reduces the size of the edge case from 0.5% to 0.1% (or whatever). It'll still match things
like 'cat datanode.txt'.  Execution speed wise, though, it's pretty expensive when one considers
that we've doubled the # of forks for every start/status/stop request.  That'll have an impact
esp in places like QA.

But giving some further thought to it... I think you're on to something that might work pretty
well... hmm...

off the top:
pspid=$(ps -fp "${pid}" 2>/dev/null)

if [[ $? -ne 0]]; then
  if [[ ${pspid} =~ Dproc_${daemonname} ]]; then

or whatever.  [e.g., that $? construction has issues.]

I think that'd be nearly the same cost as we have now and doesn't make the edge-case situation
more expensive.  It also avoids the IO that's very tempting by writing the ps output to a
temp file. The 'grep' is replaced by an internal regex check and lsince 3.x consistently defines
proc_ for jps usage we can bounce off of that to reduce the search space even more.

It's still not foolproof, but it does cut down the chances of false positives.  It's just
a matter of if it's worth it or not.

BTW, there are some other patches out there regarding this code but I haven't had a chance
to really play with the edge cases. (and there are a lot.)

> Hadoop scripts may errantly believe a daemon is still running, preventing it from starting
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-14855
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14855
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Aaron T. Myers
> I encountered a case recently where the NN wouldn't start, with the error message "namenode
is running as process 16769.  Stop it first." In fact the NN was not running at all, but rather
another long-running process was running with this pid.
> It looks to me like our scripts just check to see if _any_ process is running with the
pid that the NN (or any Hadoop daemon) most recently ran with. This is clearly not a fool-proof
way of checking to see if a particular type of daemon is now running, as some other process
could start running with the same pid since the daemon in question was previously shut down.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message