Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D905106E7 for ; Thu, 21 Nov 2013 18:56:06 +0000 (UTC) Received: (qmail 27346 invoked by uid 500); 21 Nov 2013 18:56:04 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 27281 invoked by uid 500); 21 Nov 2013 18:56:04 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 27273 invoked by uid 99); 21 Nov 2013 18:56:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 18:56:04 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com designates 209.85.215.174 as permitted sender) Received: from [209.85.215.174] (HELO mail-ea0-f174.google.com) (209.85.215.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 18:55:57 +0000 Received: by mail-ea0-f174.google.com with SMTP id b10so63793eae.33 for ; Thu, 21 Nov 2013 10:55:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:references:from:content-type:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; bh=TenrdGly+JAb77SPKDnhmZf6wUr0yLljw3jo7MKAGK0=; b=xLDC1vL+StBorgbn0UkGP7msnhJDx6x6wSkCm3wAMt7vYTh9maJS9EiGqTKMJXvLdo dw84ly6rSg6mO5V6tmTMV7kkPo9S2RFRMtFYCXWU6Db+XBOFBRbkUovjsAs3kINfvt+A ZjTWJDsQyq79ygLgSXCgJp/x4SvURqGyHmxsqrAO73nBJy3NrrTfQLwe/d3rZFXm1MZ7 fdaTI+ETC8BHPiI5mAnHoofDllp7T0naB/W//RXa6qrJwcVy96d3SWKG9bLzP0ulSnqj XPkvRhwM6GRUJ0dh5A4s37ZNIi9I6tzARhVi2UGBimmuRr9S4ua3HzHQ8S6/+/P4f72B emLA== X-Received: by 10.15.26.131 with SMTP id n3mr10555271eeu.21.1385060137341; Thu, 21 Nov 2013 10:55:37 -0800 (PST) Received: from [10.0.1.6] (ip1.c3191.amb314.cust.comxnet.dk. [87.73.80.154]) by mx.google.com with ESMTPSA id g8sm72933494eep.20.2013.11.21.10.55.35 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 21 Nov 2013 10:55:35 -0800 (PST) Subject: Re: query resulting in many small output files causes timeout error in Hue References: From: Tim Content-Type: multipart/alternative; boundary=Apple-Mail-89558899-789F-4E78-9449-711D47A0B819 X-Mailer: iPhone Mail (11B554a) In-Reply-To: Message-Id: <91F0558C-FF17-49B5-A678-701CE1562C73@gmail.com> Date: Thu, 21 Nov 2013 19:55:35 +0100 To: "user@hive.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-89558899-789F-4E78-9449-711D47A0B819 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Or setting reducers to 1 and doing a GROUP BY all columns forces a single fi= le too. Tim, Sent from my iPhone (which makes terrible auto-correct spelling mistakes) > On 21 Nov 2013, at 18:27, Eric Chu wrote: >=20 > Hi, >=20 > We often have map-only queries that result in a large number of small outp= ut files (in the thousands). Although this doesn't affect CLI, when users tr= y to view/download the query result in Hue, Hue would time out in trying to r= ead all these small files. We tried to set the following properties that sup= posedly will make Hive launch an extra MR job to merge these files when the a= verage file size is smaller than some threshold, but it's not working: > hive.merge.mapfiles =3D true > hive.merge.mapredfiles =3D true > hive.merge.smallfiles.avgsize =3D 32000000 (Default is 16000000) > In Hive 10, we used to have hive.mergejob.maponly set to true, but this pr= operty does not exist in Hive 11 and 12. What's the story behind this? > For example, in the following select-from-where query on a partitioned tab= le in RCFile, there would be two root stages - one doing a scan with filter a= nd the other doing a fetch. >=20 > Query: >=20 > select data_date as date, ID, if(col_10=3D1, "yes","no") as answer > from table_1 > where arr[4] <> "0" > and lookup("table_1", x,"action_id")=3D20519251 > and data_date>=3D20131014 >=20 > Query Plan: >=20 > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage >=20 > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > table_1 > TableScan > alias: table_1 > Filter Operator > predicate: > expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 'ac= tion_id') =3D 20519251)) > type: boolean > Select Operator > expressions: > expr: data_date > type: string > expr: ID > type: string > expr: if((col_10=3D 1), 'yes', 'no') > type: string > outputColumnNames: _col0, _col1, _col2 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputForm= at > output format: org.apache.hadoop.hive.ql.io.HiveIgno= reKeyTextOutputFormat >=20 > Stage: Stage-0 > Fetch Operator > limit: -1 >=20 > The query leads to 6253 output files, and the total size is 86427 bytes. M= any of the files have 8 bytes and the ones that have more than 8 bytes usual= ly have ~30 bytes. With the aforementioned settings, I'd expect an extra MR j= ob to merge the files, but that didn't happen.=20 >=20 > If anyone has some insights please let me know. >=20 > Thanks, >=20 > Eric --Apple-Mail-89558899-789F-4E78-9449-711D47A0B819 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit
Or setting reducers to 1 and doing a GROUP BY all columns forces a single file too.

Tim,
Sent from my iPhone (which makes terrible auto-correct spelling mistakes)

On 21 Nov 2013, at 18:27, Eric Chu <echu@rocketfuel.com> wrote:

Hi,

We often have map-only queries that result in a large number of small output files (in the thousands). Although this doesn't affect CLI, when users try to view/download the query result in Hue, Hue would time out in trying to read all these small files. We tried to set the following properties that supposedly will make Hive launch an extra MR job to merge these files when the average file size is smaller than some threshold, but it's not working:
  1. hive.merge.mapfiles = true
  2. hive.merge.mapredfiles = true
  3. hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
  4. In Hive 10, we used to have hive.mergejob.maponly set to true, but this property does not exist in Hive 11 and 12. What's the story behind this?

For example, in the following select-from-where query on a partitioned table in RCFile, there would be two root stages - one doing a scan with filter and the other doing a fetch.

Query:

select data_date as date, ID, if(col_10=1, "yes","no") as answer
from table_1
where arr[4] <> "0"
and lookup("table_1", x,"action_id")=20519251
and data_date>=20131014

Query Plan:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        table_1
          TableScan
            alias: table_1
            Filter Operator
              predicate:
                  expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 'action_id') = 20519251))
                  type: boolean
              Select Operator
                expressions:
                      expr: data_date
                      type: string
                      expr: ID
                      type: string
                      expr: if((col_10= 1), 'yes', 'no')
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

The query leads to 6253 output files, and the total size is 86427 bytes. Many of the files have 8 bytes and the ones that have more than 8 bytes usually have ~30 bytes. With the aforementioned settings, I'd expect an extra MR job to merge the files, but that didn't happen.

If anyone has some insights please let me know.

Thanks,

Eric
--Apple-Mail-89558899-789F-4E78-9449-711D47A0B819--