Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BECC10D33 for ; Mon, 6 Jan 2014 19:24:54 +0000 (UTC) Received: (qmail 10312 invoked by uid 500); 6 Jan 2014 19:24:52 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 10233 invoked by uid 500); 6 Jan 2014 19:24:51 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 10179 invoked by uid 500); 6 Jan 2014 19:24:51 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 10159 invoked by uid 99); 6 Jan 2014 19:24:51 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jan 2014 19:24:51 +0000 Date: Mon, 6 Jan 2014 19:24:51 +0000 (UTC) From: "Eric Chu (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863292#comment-13863292 ] Eric Chu commented on HIVE-6134: -------------------------------- Thanks [~ashutoshc] for pointing out the concatenate command. However, I think the ability to merge files for a table partition is orthogonal to supporting hive.merge.mapfiles, hive.merge.mapredfiles, and hive.merge.smallfiles.avgsize for "regular" queries (i.e., that don't result in a new table). Even if we have the optimal number of files at input for each partition, users querying over a large number of partitions with just SELECT FROM WHERE clauses will result in a large number of small output files, and there will be negative sides effects such as Hue timeout, the next job will have a large number of mappers, etc. Can someone explain why the properties are supported only for queries with move tasks? Was it just a matter of scoping, or is there some reason that makes this inappropriate for queries without a move task? We are considering adding this support on our own and would like to get some insights on the original design considerations. Thanks! > Merging small files based on file size only works for CTAS queries > ------------------------------------------------------------------ > > Key: HIVE-6134 > URL: https://issues.apache.org/jira/browse/HIVE-6134 > Project: Hive > Issue Type: Bug > Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0 > Reporter: Eric Chu > > According to the documentation, if we set hive.merge.mapfiles to true, Hive will launch an additional MR job to merge the small output files at the end of a map-only job when the average output file size is smaller than hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles to true, Hive will merge the output files of a map-reduce job. > My expectation is that this is true for all MR queries. However, my observation is that this is only true for CTAS queries. In GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a regular SELECT query that doesn't have move tasks, these properties are not used. > Is my understanding correct and if so, what's the reasoning behind the logic of not supporting this for regular SELECT queries? It seems to me that this should be supported for regular SELECT queries as well. One scenario where this hits us hard is when users try to download the result in HUE, and HUE times out b/c there are thousands of output files. The workaround is to re-run the query as CTAS, but it's a significant time sink. -- This message was sent by Atlassian JIRA (v6.1.5#6160)