hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siying Dong <siyin...@facebook.com>
Subject RE: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes
Date Thu, 29 Jul 2010 19:21:03 GMT
Larger files are not guaranteed to be the right ones. (For example, there could be user defined
transform scripts that can freely access external resources and generate anything which we
don't have control.) But larger files, rather than the first one, are much more likely to
be the correct one. Before we use the new MapReduce API to fix the issue of generating wrong
results in MapReduce, this patch will help us fix the problem in most scenarios.

-----Original Message-----
From: He Yongqiang (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, July 29, 2010 12:12 PM
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from
the same task based on file sizes


    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782
] 

He Yongqiang commented on HIVE-1492:
------------------------------------

The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only
one file for each task. A task could produce multiple files due to failed attempts or speculative
runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Mime
View raw message