hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes
Date Mon, 16 Aug 2010 22:43:17 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899152#action_12899152
] 

Ning Zhang commented on HIVE-1492:
----------------------------------

Agree that we should catch the exception in (Combine)HiveRecordReader, but they are only used
in map side. In the reducer, RecordReader was not called and there could also be exceptions
outside of reducer(). This fix catches that case as well.

I've filed another JIRA HIVE-1543 for catching exceptions in RecrodReaders. 



> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only
one file for each task. A task could produce multiple files due to failed attempts or speculative
runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message