hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Dere (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-21214) MoveTask : Use attemptId instead of file size for deduplication of files compareTempOrDuplicateFiles()
Date Tue, 05 Feb 2019 19:51:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761153#comment-16761153
] 

Jason Dere commented on HIVE-21214:
-----------------------------------

I'm not totally sure about the decision to change duplicate filename resolution from file
size to task attempt number. If you just fixed the file size logic to take directories into
account this would allow the existing logic to work in the directory case. With task attempts
we might have to worry about if this breaks any existing cases. If we are convinced that we
just need to worry about Tez execution then I guess this could work, but this does not work
on M/R with speculative execution.

In terms of code comments, might be better with RB, but I'll add comments here:
 * For the comments at the top of compareTempOrDuplicateFiles(), add a comment this this
breaks speculative execution.
 * getDirSize() may not be the best name - this is really getting the file size, and doing
so recursively in the case that the file turns out to be a directory. So maybe getFileSizeRecursivey()
or something.
 * Log at debug level in getDirSize()

I still need to make sense of the parsing changes

> MoveTask : Use attemptId instead of file size for deduplication of files compareTempOrDuplicateFiles()
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21214
>                 URL: https://issues.apache.org/jira/browse/HIVE-21214
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Deepak Jaiswal
>            Assignee: Deepak Jaiswal
>            Priority: Major
>         Attachments: HIVE-21214.1.patch
>
>
> For a given task, if there is more than one attempt then deduplication logic kicks in.
> {noformat}
> Utilities.compareTempOrDuplicateFiles(){noformat}
> The logic uses file size and picks the one with largest size. This logic is very fragile.
> ideally, it should pick the successful attempt's file.
> However, a simpler solution is to pick the newest attempt and also checking the file
size for the newest attempt is the largest.
> If not, throw an exception.
>  
> cc [~gopalv] [~thejas] [~jdere] [~ekoifman]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message