hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1118) Hive merge map files should have different bytes/mapper setting
Date Fri, 29 Jan 2010 18:56:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806435#action_12806435
] 

Zheng Shao commented on HIVE-1118:
----------------------------------

That's still much better than NOT running the merge task.
With the currently setting, almost nobody will enable this by default. As a result, we are
seeing a lot of 1KB files in the HDFS.

If we make this change, we can enable it by default.


I agree 1MB is not a good default. We can set it to 32MB or 64MB (and on by default).

If that's not good enough, let's introduce another parameter so we can say (32MB, 64MB), which
will start the merge job if average size of file is smaller than 32MB, and we will end up
with files with 64MB.

Thoughts?

> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is
smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of
data.
> Alternatively, we can just use that threshold to determine the number of reducers per
job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file
size is less than 1MB, and the eventual result file size will be around 1MB (or another small
number).
> This will remove the extreme cases where we have thousands of empty files, but still
make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message