hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Chu (JIRA)" <>
Subject [jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries
Date Sun, 05 Jan 2014 01:28:50 GMT


Eric Chu commented on HIVE-6134:

[] We notice that the problem occurs when a query results in too many
files; however, this happens b/c the table has too many (but not necessarily small) files.
Most of the queries that have this problem are regular SELECT FROM WHERE queries (no GROUP
BY) that don't have reducers. Some of our tables have hundreds of GBs per partition; the biggest
one has TBs of data per partition. It's not uncommon to see queries with thousands or tens
of thousands of mappers, but no reducers. 

We are looking at other ways to mitigate this problem. What you suggest - merging files in
a partition - is certainly something we are considering. Meanwhile, I want to consider supporting
these properties for queries without a move task. Specifically, what are the reasons that
we didn't support these properties for queries without a move tasks? And if we want to do
do, what considerations should we make? We'd be willing to work on this, but we probably will
need some guidance from domain experts. Thanks!

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>                 Key: HIVE-6134
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
> According to the documentation, if we set hive.merge.mapfiles to true, Hive will launch
an additional MR job to merge the small output files at the end of a map-only job when the
average output file size is smaller than hive.merge.smallfiles.avgsize. Similarly, by setting
hive.merge.mapredfiles to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my observation is that
are only used if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So,
for a regular SELECT query that doesn't have move tasks, these properties are not used.
> Is my understanding correct and if so, what's the reasoning behind the logic of not supporting
this for regular SELECT queries? It seems to me that this should be supported for regular
SELECT queries as well. One scenario where this hits us hard is when users try to download
the result in HUE, and HUE times out b/c there are thousands of output files. The workaround
is to re-run the query as CTAS, but it's a significant time sink.

This message was sent by Atlassian JIRA

View raw message