hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14886) File deduplication in FSOP is not used correctly for list bucketing
Date Tue, 04 Oct 2016 18:00:23 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546146#comment-15546146
] 

Sergey Shelukhin commented on HIVE-14886:
-----------------------------------------

[~brocknoland] [~mohitsabharwal] you guys seem to have touched list bucketing last... are
you familiar with that feature?

> File deduplication in FSOP is not used correctly for list bucketing
> -------------------------------------------------------------------
>
>                 Key: HIVE-14886
>                 URL: https://issues.apache.org/jira/browse/HIVE-14886
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> I am making things work for MM tables, so I noticed this after adding the logging to
removeTempOrDuplicateFiles/2 method that is called from FSOP:
> {noformat}
>       } else /* sershe: means "if !isTempPath(one)" */ {
>         String taskId = getPrefixedTaskIdFromFilename(one.getPath().getName());
>         Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " + one.getPath()
+ ", taskId " + taskId);
> {noformat}
> This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then via non-dynpart
path in removeTempOrDuplicateFiles/4.
> taskId line is from the original code, so it's used later to decide on the fate of the
file.
> The files passed in are from the root of the table, disregarding list bucketing, so what
happens is this:
> {noformat}
> 2016-10-03T19:01:38,615  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] Log14535: removeTempOrDuplicateFiles
pondering hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
> 2016-10-03T19:01:38,616  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] Log14535: removeTempOrDuplicateFiles
pondering hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
taskId 0 [sershe: this is only true by coincidence, task if comes from k1 value]
> {noformat}
> When I started calling the method correctly on MM path, it started deleting files for
different LB directories thinking they are the same stuff... so, some special logic may be
needed for this similar to dpCtx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message