hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-439) merge small files whenever possible
Date Fri, 19 Jun 2009 19:42:07 GMT

    [ https://issues.apache.org/jira/browse/HIVE-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721958#action_12721958
] 

Zheng Shao commented on HIVE-439:
---------------------------------

Got it. Do we want to make the whole plan (containing all tasks) serializable?  With that
we will be able to compile the job once and rerun it many times in the future.


> merge small files whenever possible
> -----------------------------------
>
>                 Key: HIVE-439
>                 URL: https://issues.apache.org/jira/browse/HIVE-439
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.439.1.patch, hive.439.2.patch, hive.439.3.patch
>
>
> There are cases when the input to a Hive job are thousands of small files. In this case,
there is a mapper for each file. Most of the overhead for spawning all these mappers can be
avoided if these small files are combined into fewer larger files.
> The problem can also be addressed by having a mapper span multiple blocks as in:
> https://issues.apache.org/jira/browse/HIVE-74
> Bit, it also makes sense in HIVE to merge files whenever possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message