hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output
Date Thu, 21 Jan 2010 21:29:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803492#action_12803492

Ning Zhang commented on HIVE-1071:

@Zheng and Dhruba, if a lot of them are small files (say less than the block size), would
it be more efficient to merge them in a compact way rather than filling them with "zeros"?
Say if we have 1000 files, each of them is 10MB. If we taking this approach, we will have
1000 blocks, which can be fit into ~40 blocks if block size is 256MB. 

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge"
these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks
except the last have to be full HDFS blocks).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message