hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chun Chen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-6309) Hive incorrectly removes TaskAttempt output files if MRAppMaster fails once
Date Sun, 26 Jan 2014 17:18:37 GMT

     [ https://issues.apache.org/jira/browse/HIVE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chun Chen updated HIVE-6309:
----------------------------

    Description: 
We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than
yesterday after a mid night ETL process. We find that these MapReduce jobs which generate
the partial tables have something in common that the MRAppMaster of which all had failed once
and the tables all left only a single data file 000000_1000.

The following log in hive.log give us some clues of what's going on with the incorrectly deleted
data files.
{code}
$ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
2014-01-24 12:52:43,140 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000
with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,142 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000
with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,149 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000
with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,151 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000
with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
{code}

We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't
not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
{code}
// org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
    // All the new TaskAttemptIDs are generated based on MR
    // ApplicationAttemptID so that attempts from previous lives don't
    // over-step the current one. This assumes that a task won't have more
    // than 1000 attempts in its single generation, which is very reasonable.
    nextAttemptNumber = (appAttemptId - 1) * 1000;

// ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
   /**
    * The first group will contain the task id. The second group is the optional extension.
The file
    * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId()
can
    * return an integer only - this should match a pure integer as well. {1,3} is used to
limit
    * matching for attempts #'s 0-999.
    */
   private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
       Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
{code}

And with the bellow reasons,  extract this value for attempt numbers >= 1000 : 
{code}
>>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
'000000'
>>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
'1001'
{code}

  was:
We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than
yesterday after a mid night ETL process. We find that these MapReduce jobs which generate
the partial tables have something in common that the MRAppMaster of which had failed once
and the tables all left only a single data file 000000_1000.

The following log in hive.log give us some clues of what's going on with the incorrectly deleted
data files.
{code}
$ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
2014-01-24 12:52:43,140 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000
with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,142 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000
with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,149 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000
with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
2014-01-24 12:52:43,151 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000
with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
{code}

We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't
not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
{code}
// org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
    // All the new TaskAttemptIDs are generated based on MR
    // ApplicationAttemptID so that attempts from previous lives don't
    // over-step the current one. This assumes that a task won't have more
    // than 1000 attempts in its single generation, which is very reasonable.
    nextAttemptNumber = (appAttemptId - 1) * 1000;

// ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
   /**
    * The first group will contain the task id. The second group is the optional extension.
The file
    * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId()
can
    * return an integer only - this should match a pure integer as well. {1,3} is used to
limit
    * matching for attempts #'s 0-999.
    */
   private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
       Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
{code}

And with the bellow reasons,  extract this value for attempt numbers >= 1000 : 
{code}
>>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
'000000'
>>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
'1001'
{code}


> Hive incorrectly removes TaskAttempt output files if MRAppMaster fails once
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-6309
>                 URL: https://issues.apache.org/jira/browse/HIVE-6309
>             Project: Hive
>          Issue Type: Bug
>         Environment: hadoop 2.2
>            Reporter: Chun Chen
>            Assignee: Chun Chen
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: HIVE-6309.patch
>
>
> We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files
than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate
the partial tables have something in common that the MRAppMaster of which all had failed once
and the tables all left only a single data file 000000_1000.
> The following log in hive.log give us some clues of what's going on with the incorrectly
deleted data files.
> {code}
> $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
> 2014-01-24 12:52:43,140 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000
with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
> 2014-01-24 12:52:43,142 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000
with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
> 2014-01-24 12:52:43,149 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000
with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
> 2014-01-24 12:52:43,151 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535))
- Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000
with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
with length 824860643
> {code}
> We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive
doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
> and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
> {code}
> // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
>     // All the new TaskAttemptIDs are generated based on MR
>     // ApplicationAttemptID so that attempts from previous lives don't
>     // over-step the current one. This assumes that a task won't have more
>     // than 1000 attempts in its single generation, which is very reasonable.
>     nextAttemptNumber = (appAttemptId - 1) * 1000;
> // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
>    /**
>     * The first group will contain the task id. The second group is the optional extension.
The file
>     * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since
getTaskId() can
>     * return an integer only - this should match a pure integer as well. {1,3} is used
to limit
>     * matching for attempts #'s 0-999.
>     */
>    private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
>        Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
> {code}
> And with the bellow reasons,  extract this value for attempt numbers >= 1000 : 
> {code}
> >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
> '000000'
> >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
> '1001'
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message