Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of athusoo@facebook.com designates
 69.63.179.25 as permitted sender)
From: Ashish Thusoo <athusoo@facebook.com>
To: "hive-user@hadoop.apache.org" <hive-user@hadoop.apache.org>
Date: Fri, 11 Jun 2010 16:09:49 -0700
Subject: RE: Dealing with large number of partitions
Thread-Topic: Dealing with large number of partitions
Thread-Index: AcsJMHatQcbzralXReGOa3vpJyWu+gAiqadQ
Message-ID: 
 <68B7689C98024D43B4C2709456F0B5200B3AC70084@SC-MBXC1.TheFacebook.com>
References: <AANLkTin86AgHs5BsORDB_CEK9OuutbkK6fLLK7cHPvtZ@mail.gmail.com>
 <AANLkTim6KiGXNpzmlVtoC8p-HjYkc9byFgcT3VxkX2ON@mail.gmail.com>
In-Reply-To: <AANLkTim6KiGXNpzmlVtoC8p-HjYkc9byFgcT3VxkX2ON@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_68B7689C98024D43B4C2709456F0B5200B3AC70084SCMBXC1TheFac_"
MIME-Version: 1.0

--_000_68B7689C98024D43B4C2709456F0B5200B3AC70084SCMBXC1TheFac_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

+1 to that. That should help provided you are running hadoop 0.20 ..

Ashish

________________________________
From: wd [mailto:wd@wdicc.com]
Sent: Thursday, June 10, 2010 11:36 PM
To: hive-user@hadoop.apache.org
Subject: Re: Dealing with large number of partitions


Try set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFo=
rmat; before you query, this may be help.


2010/6/11 Sammy Yu <syu@brightedge.com<mailto:syu@brightedge.com>>
Hi,
   I am having an issue with a large number of 4000 partitions (each being =
very small <10k files).  Any queries that I do which involve these partitio=
ns take an extremely long time to complete (10+ hours), I was wondering if =
there was any easy way in hive without having to merge the files improve it=
's performance.  I can see the map reduce jobs are taking a long time due t=
o the fact that there are so many separated raw data files that need to be =
read.  I saw that HIVE-1332 dealt with using HAR files for partitioning.  C=
ould this perhaps help performance rather than hurt it, given that the quer=
ies will be using all the partitions in the har file?

Thanks,
Sammy


--_000_68B7689C98024D43B4C2709456F0B5200B3AC70084SCMBXC1TheFac_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.6000.17023" name=3DGENERATOR></HEAD>
<BODY>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D635160923-11062010><FONT face=3DA=
rial=20
color=3D#0000ff size=3D2>+1 to that. That should help provided you are runn=
ing=20
hadoop 0.20 ..</FONT></SPAN></DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D635160923-11062010><FONT face=3DA=
rial=20
color=3D#0000ff size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D635160923-11062010><FONT face=3DA=
rial=20
color=3D#0000ff size=3D2>Ashish</FONT></SPAN></DIV><BR>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> wd [mailto:wd@wdicc.com] <BR><B>S=
ent:</B>=20
Thursday, June 10, 2010 11:36 PM<BR><B>To:</B>=20
hive-user@hadoop.apache.org<BR><B>Subject:</B> Re: Dealing with large numbe=
r of=20
partitions<BR></FONT><BR></DIV>
<DIV></DIV><PRE>Try set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.Co=
mbineHiveInputFormat; before you query, this may be help.</PRE><BR><BR>
<DIV class=3Dgmail_quote>2010/6/11 Sammy Yu <SPAN dir=3Dltr>&lt;<A=20
href=3D"mailto:syu@brightedge.com">syu@brightedge.com</A>&gt;</SPAN><BR>
<BLOCKQUOTE class=3Dgmail_quote=20
style=3D"PADDING-LEFT: 1ex; MARGIN: 0pt 0pt 0pt 0.8ex; BORDER-LEFT: rgb(204=
,204,204) 1px solid">Hi,
  <DIV>&nbsp;&nbsp; I am having an issue with a large number of 4000 partit=
ions=20
  (each being very small &lt;10k files). &nbsp;Any queries that I do which=
=20
  involve these partitions take an extremely long time to complete (10+ hou=
rs),=20
  I was wondering if there was any easy way in hive without having to merge=
 the=20
  files improve it's performance. &nbsp;I can see the map reduce jobs are t=
aking=20
  a long time due to the fact that there are so many&nbsp;separated&nbsp;ra=
w=20
  data files that need to be read. &nbsp;I saw that HIVE-1332 dealt with us=
ing=20
  HAR files for partitioning. &nbsp;Could this perhaps help performance rat=
her=20
  than hurt it, given that the queries will be using all the partitions in =
the=20
  har file?</DIV>
  <DIV><BR></DIV>
  <DIV>Thanks,</DIV>
  <DIV>Sammy</DIV><FONT color=3D#888888>
  <DIV>&nbsp;</DIV>
  <DIV><BR clear=3Dall><BR></DIV>
  <DIV><BR></DIV>
  <DIV><BR></DIV></FONT></BLOCKQUOTE></DIV><BR></BODY></HTML>

--_000_68B7689C98024D43B4C2709456F0B5200B3AC70084SCMBXC1TheFac_--