Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com
 designates 209.85.215.174 as permitted sender)
Subject: Re: query resulting in many small output files causes timeout error
 in Hue
References: 
 <CAHEx3F_Dew9Fu2L+rMeyxieCP5v9BRdQ8QJS+uDiXhDCDwaLvg@mail.gmail.com>
From: Tim <timrobertson100@gmail.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-89558899-789F-4E78-9449-711D47A0B819
In-Reply-To: 
 <CAHEx3F_Dew9Fu2L+rMeyxieCP5v9BRdQ8QJS+uDiXhDCDwaLvg@mail.gmail.com>
Message-Id: <91F0558C-FF17-49B5-A678-701CE1562C73@gmail.com>
Date: Thu, 21 Nov 2013 19:55:35 +0100
To: "user@hive.apache.org" <user@hive.apache.org>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)


--Apple-Mail-89558899-789F-4E78-9449-711D47A0B819
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Or setting reducers to 1 and doing a GROUP BY all columns forces a single fi=
le too.

Tim,
Sent from my iPhone (which makes terrible auto-correct spelling mistakes)

> On 21 Nov 2013, at 18:27, Eric Chu <echu@rocketfuel.com> wrote:
>=20
> Hi,
>=20
> We often have map-only queries that result in a large number of small outp=
ut files (in the thousands). Although this doesn't affect CLI, when users tr=
y to view/download the query result in Hue, Hue would time out in trying to r=
ead all these small files. We tried to set the following properties that sup=
posedly will make Hive launch an extra MR job to merge these files when the a=
verage file size is smaller than some threshold, but it's not working:
> hive.merge.mapfiles =3D true
> hive.merge.mapredfiles =3D true
> hive.merge.smallfiles.avgsize =3D 32000000 (Default is 16000000)
> In Hive 10, we used to have hive.mergejob.maponly set to true, but this pr=
operty does not exist in Hive 11 and 12. What's the story behind this?
> For example, in the following select-from-where query on a partitioned tab=
le in RCFile, there would be two root stages - one doing a scan with filter a=
nd the other doing a fetch.
>=20
> Query:
>=20
> select data_date as date, ID, if(col_10=3D1, "yes","no") as answer
> from table_1
> where arr[4] <> "0"
> and lookup("table_1", x,"action_id")=3D20519251
> and data_date>=3D20131014
>=20
> Query Plan:
>=20
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>=20
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         table_1
>           TableScan
>             alias: table_1
>             Filter Operator
>               predicate:
>                   expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 'ac=
tion_id') =3D 20519251))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: data_date
>                       type: string
>                       expr: ID
>                       type: string
>                       expr: if((col_10=3D 1), 'yes', 'no')
>                       type: string
>                 outputColumnNames: _col0, _col1, _col2
>                 File Output Operator
>                   compressed: true
>                   GlobalTableId: 0
>                   table:
>                       input format: org.apache.hadoop.mapred.TextInputForm=
at
>                       output format: org.apache.hadoop.hive.ql.io.HiveIgno=
reKeyTextOutputFormat
>=20
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>=20
> The query leads to 6253 output files, and the total size is 86427 bytes. M=
any of the files have 8 bytes and the ones that have more than 8 bytes usual=
ly have ~30 bytes. With the aforementioned settings, I'd expect an extra MR j=
ob to merge the files, but that didn't happen.=20
>=20
> If anyone has some insights please let me know.
>=20
> Thanks,
>=20
> Eric

--Apple-Mail-89558899-789F-4E78-9449-711D47A0B819
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: 7bit

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div>Or setting reducers to 1 and doing a GROUP BY all columns forces a single file too.<br><br><div>Tim,</div>Sent from my iPhone (which makes terrible auto-correct spelling mistakes)</div><div><br>On 21 Nov 2013, at 18:27, Eric Chu &lt;<a href="mailto:echu@rocketfuel.com">echu@rocketfuel.com</a>&gt; wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr">Hi,<br><div><br><div>We often have map-only queries that result in a large number of 
small output files (in the thousands). Although this doesn't affect CLI, when users try to view/download 
the query result in Hue, Hue would time out in trying to read all these 
small files. We tried to set the following properties that supposedly will make 
Hive launch an extra MR job to merge these files when the average file 
size is smaller than some threshold, but it's not working: 
</div><ol><li>hive.merge.mapfiles = true<br></li><li>hive.merge.mapredfiles = true<br></li><li>hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)<br>
</li><li>In
 Hive 10, we used to have hive.mergejob.maponly set to true, but this 
property does not exist in Hive 11 and 12. What's the story behind this?</li></ol><p>For example, in the following select-from-where query on a partitioned table in RCFile, there would be two root stages - one doing a scan with filter and the other doing a fetch.</p>

<p><b>Query</b>:<br></p><p>select data_date as date, ID, if(col_10=1, "yes","no") as answer<br>from table_1<br>where arr[4] &lt;&gt; "0"<br>and lookup("table_1", x,"action_id")=20519251<br>


and data_date&gt;=20131014<br></p><p><b>Query Plan:</b><br></p><p>STAGE DEPENDENCIES:<br>&nbsp; Stage-1 is a root stage<br>&nbsp; Stage-0 is a root stage<br><br>STAGE PLANS:<br>&nbsp; Stage: Stage-1<br>&nbsp;&nbsp;&nbsp; Map Reduce<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Alias -&gt; Map Operator Tree:<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  table_1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; TableScan<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; alias:  table_1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Filter Operator<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; predicate:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 expr: ((arr[4] &lt;&gt; '0') and (dim_lookup('table_1', x, 'action_id') = 20519251))<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type: boolean<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Select Operator<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; expressions:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; expr: data_date<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type: string<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; expr: ID<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type: string<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; expr: if((col_10= 1), 'yes', 'no')<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type: string<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; outputColumnNames: _col0, _col1, _col2<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; File Output Operator<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; compressed: true<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GlobalTableId: 0<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; table:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; input format: org.apache.hadoop.mapred.TextInputFormat<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<br><br>&nbsp; Stage: Stage-0<br>&nbsp;&nbsp;&nbsp; Fetch Operator<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; limit: -1<br></p><p>The query leads to 6253 output files, and the total size is 86427 bytes. Many of the 
files have 8 bytes and the ones that have more than 8 bytes 
usually have ~30 bytes. With the aforementioned settings, I'd expect an 
extra MR job to merge the files, but that didn't happen. <br></p><p>If anyone has some insights please let me know.</p><p>Thanks,</p>Eric<br></div></div>
</div></blockquote></body></html>
--Apple-Mail-89558899-789F-4E78-9449-711D47A0B819--