hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries
Date Fri, 03 Jan 2014 19:49:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861846#comment-13861846
] 

Xuefu Zhang commented on HIVE-6134:
-----------------------------------

It seems reasonable to me that these flags kicks in only for CTAS, or other queries that resulting
a new table. In other words, the functionality of merging small files for a table should be
applied to table (upon request) rather than coming in effect for any query that touches the
table. I think what is missing is a new command/query something like "MERGE FILES FOR TABLE
table_name". This might be further automated in a scheduled fashion in HiveServer2. Of course,
the scope is much larger.

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
>                 Key: HIVE-6134
>                 URL: https://issues.apache.org/jira/browse/HIVE-6134
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive will launch
an additional MR job to merge the small output files at the end of a map-only job when the
average output file size is smaller than hive.merge.smallfiles.avgsize. Similarly, by setting
hive.merge.mapredfiles to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my observation is that
this is only true for CTAS queries. In GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES
are only used if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So,
for a regular SELECT query that doesn't have move tasks, these properties are not used.
> Is my understanding correct and if so, what's the reasoning behind the logic of not supporting
this for regular SELECT queries? It seems to me that this should be supported for regular
SELECT queries as well. One scenario where this hits us hard is when users try to download
the result in HUE, and HUE times out b/c there are thousands of output files. The workaround
is to re-run the query as CTAS, but it's a significant time sink.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message