hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Mitic (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7190) WebHCat launcher task failure can cause two concurent user jobs to run
Date Thu, 12 Jun 2014 00:03:02 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ivan Mitic updated HIVE-7190:
-----------------------------

    Attachment: HIVE-7190.3.patch

Addressing Eugene's feedback. 

> WebHCat launcher task failure can cause two concurent user jobs to run
> ----------------------------------------------------------------------
>
>                 Key: HIVE-7190
>                 URL: https://issues.apache.org/jira/browse/HIVE-7190
>             Project: Hive
>          Issue Type: Bug
>          Components: WebHCat
>    Affects Versions: 0.13.0
>            Reporter: Ivan Mitic
>         Attachments: HIVE-7190.2.patch, HIVE-7190.3.patch, HIVE-7190.patch
>
>
> Templeton uses launcher jobs to launch the actual user jobs. Launcher jobs are 1-map
jobs (a single task jobs) which kick off the actual user job and monitor it until it finishes.
Given that the launcher is a task, like any other MR task, it has a retry policy in case it
fails (due to a task crash, tasktracker/nodemanager crash, machine level outage, etc.). Further,
when launcher task is retried, it will again launch the same user job, *however* the previous
attempt user job is already running. What this means is that we can have two identical user
jobs running in parallel. 
> In case of MRv2, there will be an MRAppMaster and the launcher task, which are subject
to failure. In case any of the two fails, another instance of a user job will be launched
again in parallel. 
> Above situation is already a bug.
> Now going further to RM HA, what RM does on failover/restart is that it kills all containers,
and it restarts all applications. This means that if our customer had 10 jobs on the cluster
(this is 10 launcher jobs and 10 user jobs), on RM failover, all 20 jobs will be restarted,
and launcher jobs will queue user jobs again. There are two issues with this design:
> 1. There are *possible* chances for corruption of job outputs (it would be useful to
analyze this scenario more and confirm this statement).
> 2. Cluster resources are spent on jobs redundantly
> To address the issue at least on Yarn (Hadoop 2.0) clusters, webhcat should do the same
thing Oozie does in this scenario, and that is to tag all its child jobs with an id, and kill
those jobs on task restart before they are kicked off again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message