hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1463) hive output file names are unnecessarily large
Date Fri, 16 Jul 2010 19:53:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889290#action_12889290

Ning Zhang commented on HIVE-1463:

A couple of questions:
 1) in Utilities.getTaskId(). The pattern looks like '.*_[mr]_0{0,5}" which works for Hadoop
0.20. But Hadoop uses different file patterns for different hadoop version + execution mode
(local/MR). HIVE-1416 is trying to move this function to shims, but it has some diffs and
not complete yet. 
 2) previously the file names are padded by 0s before the taskID. Previously task_00004 is
before task_00033 and now task_4 is after task_33. This may introduce a problem in the bucketed
joins where the files corresponding to the same bucket are joined together. If the two tables
using different name conventions, there may be a mismatch. I think Namit and Yongqiang are
more familiar with that. Can you two comment? 

> hive output file names are unnecessarily large
> ----------------------------------------------
>                 Key: HIVE-1463
>                 URL: https://issues.apache.org/jira/browse/HIVE-1463
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: hive-1463.1.patch
> Hive's output files are named like this:
> attempt_201006221843_431854_r_000000_0
> out of all of this goop - only one character '0' would have sufficed. we should fix this.
This would help environments with namenode memory constraints.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message