ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-18240) During a Rolling Downgrade Oozie Long Running Jobs Can Fail
Date Tue, 23 Aug 2016 21:07:22 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Hurley updated AMBARI-18240:
-------------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

{code}
commit 04a534ceacb1887c4666c97ea0d1a2670fe4a1cd (HEAD -> trunk, origin/trunk, origin/HEAD)
Author: Jonathan Hurley <jhurley@hortonworks.com>
Date:   Tue Aug 23 12:03:19 2016 -0400

    AMBARI-18240 - During a Rolling Downgrade Oozie Long Running Jobs Can Fail (jonathanhurley)
{code}

> During a Rolling Downgrade Oozie Long Running Jobs Can Fail
> -----------------------------------------------------------
>
>                 Key: AMBARI-18240
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18240
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: trunk
>
>         Attachments: AMBARI-18240.patch
>
>
> - Install HDP-2.3.2.0-2950 with Ambari 2.4.0
> - Being a long-running job (LRJ) in Oozie
> - Start upgrading to HDP-2.5.0.0-1235
> - Before finalizing step, start downgrading to HDP-2.3.2.0-2950. 
> Sometimes, the LRJ will fail:
> {code}
> /usr/hdp/current/oozie-client/bin/oozie job -oozie http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie
  -info 0000001-160821214718970-oozie-oozi-C@248 
> ID : 0000001-160821214718970-oozie-oozi-C@248
> ------------------------------------------------------------------------------------------------------------------------------------
> Action Number        : 248
> Console URL          : -
> Error Code           : -
> Error Message        : -
> External ID          : 0000030-160822042035608-oozie-oozi-W
> External Status      : -
> Job ID               : 0000001-160821214718970-oozie-oozi-C
> Tracker URI          : -
> Created              : 2016-08-22 00:37 GMT
> Nominal Time         : 2009-01-01 21:35 GMT
> Status               : FAILED
> Last Modified        : 2016-08-22 05:15 GMT
> First Missing Dependency : -
> ------------------------------------------------------------------------------------------------------------------------------------
> [hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$  /usr/hdp/current/oozie-client/bin/oozie
job -oozie http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie   -info
0000030-160822042035608-oozie-oozi-W
> Job ID : 0000030-160822042035608-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------------------
> Workflow Name : wordcount
> App Path      : hdfs://nameservice/user/hrt_qa/test_oozie_long_running
> Status        : FAILED
> Run           : 0
> User          : hrt_qa
> Group         : -
> Created       : 2016-08-22 05:08 GMT
> Started       : 2016-08-22 05:08 GMT
> Last Modified : 2016-08-22 05:15 GMT
> Ended         : 2016-08-22 05:15 GMT
> CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248
> Actions
> ------------------------------------------------------------------------------------------------------------------------------------
> ID                                                                            Status
   Ext ID                 Ext Status Err Code  
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@wc                                       FAILED
   job_1471842441396_0002 FAILED     JA017     
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@:start:                                  OK    
   -                      OK         -         
> ------------------------------------------------------------------------------------------------------------------------------------
> {code}
> This is caused by an outage of both NameNodes during the downgrade. 
> - We have two NNs at the "Finalize Upgrade" state; 
> -- nn1 is standby (out of safemode)
> -- nn2 is active (out of safemode)
> - A downgrade begins and we restart nn1
> -- After the restart of nn1, it hasn't come online yet. Our code tries to contact it
and can't, so we move onto nn2.
> -- nn2 is online and active and out of safemode (because it hasn't been downgraded yet),
so we let the downgrade continue
> - The downgrade continues and we restart nn2
> -- However, nn1 is still coming online and isn't even standby yet
> Now we have an nn1 which isn't fully loaded and an nn2 which is restarting and trying
to figure out whether to be active or standby. It's during this gap that the tests must be
failing. 
> So, it seems like we need to be a little bit smarter about waiting for the namenode to
restart; we can't just look at the "active" one and say things are OK because it might be
the next one to restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message