hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output
Date Thu, 21 Jan 2010 20:40:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803479#action_12803479

He Yongqiang commented on HIVE-1071:

Concating files to a single file has 2 problems to solve: 
1) the partial last block of each middle files need to zero filled (why hdfs assume all blocks
in a single file have the same size, will the DfsClient check that?) .  
2) remove the file header of all middle files.
1) is easy to do, but how we do 2)? 
Another possible consideration is to use sth like HAR. We can pack files into a single file,
and let hdfs/namenode only know about the packed file. In this way, we even can pack files
with different file formats together.

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge"
these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks
except the last have to be full HDFS blocks).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message