hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Chu (JIRA)" <>
Subject [jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries
Date Mon, 07 Apr 2014 22:26:16 GMT


Eric Chu commented on HIVE-6134:

Hi [~xuefuz] and [~ashutoshc], it turns out this issues not only affects Hue but also HIVE
CLI - in that results won't show up in CLI until more than a minute has passed with timeout
error for connection to nodes.

I'm trying to make the change myself in to support a new property that
when it's turned on, Hive will merge files for a regular (i.e., without mvTask), map-only
job that uses more than X mappers (another property). I'm wondering if and how we could find
out the number of mappers that will be used for that job when we are at that stage of the
optimization. I want to set chDir to true when this number is greater than some threshold
set via a new property.  I notice that currWork.getMapWork().getNumMapTasks() actually returns
null. Can you give me some pointers?

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>                 Key: HIVE-6134
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
> According to the documentation, if we set hive.merge.mapfiles to true, Hive will launch
an additional MR job to merge the small output files at the end of a map-only job when the
average output file size is smaller than hive.merge.smallfiles.avgsize. Similarly, by setting
hive.merge.mapredfiles to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my observation is that
are only used if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So,
for a regular SELECT query that doesn't have move tasks, these properties are not used.
> Is my understanding correct and if so, what's the reasoning behind the logic of not supporting
this for regular SELECT queries? It seems to me that this should be supported for regular
SELECT queries as well. One scenario where this hits us hard is when users try to download
the result in HUE, and HUE times out b/c there are thousands of output files. The workaround
is to re-run the query as CTAS, but it's a significant time sink.

This message was sent by Atlassian JIRA

View raw message