Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DFA1FB51 for ; Fri, 5 Apr 2013 18:36:35 +0000 (UTC) Received: (qmail 31000 invoked by uid 500); 5 Apr 2013 18:36:33 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 30945 invoked by uid 500); 5 Apr 2013 18:36:33 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 30937 invoked by uid 99); 5 Apr 2013 18:36:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Apr 2013 18:36:33 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLYTO_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.138.90.80] (HELO nm17.bullet.mail.ne1.yahoo.com) (98.138.90.80) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Apr 2013 18:36:26 +0000 Received: from [98.138.226.178] by nm17.bullet.mail.ne1.yahoo.com with NNFMP; 05 Apr 2013 18:36:04 -0000 Received: from [98.138.226.165] by tm13.bullet.mail.ne1.yahoo.com with NNFMP; 05 Apr 2013 18:36:04 -0000 Received: from [127.0.0.1] by omp1066.mail.ne1.yahoo.com with NNFMP; 05 Apr 2013 18:36:04 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 572236.70785.bm@omp1066.mail.ne1.yahoo.com Received: (qmail 76663 invoked by uid 60001); 5 Apr 2013 18:36:04 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1365186964; bh=S4IynA7jvrH9RHGaSTEGjT6BEdn3uEM7tqPDcMDZuJs=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=ODihKw1j8qS1m7UbpNXU5xjQwZVAXbJn3i9Lg9Xlol5bY0C+vAPEbXoXlGm8oRS+uZ6r+hreDOGn4O4FNZ3vmXv7YL2xTFm7heRXBjgGZb9Cw0yWwglNtcp5dax4Xo++IYxpz9BtwneeqgRfVi6upyRPc1RjA4QaU6WVGeqIxJ4= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=pF2nSWl05DaBvEyZbDlh2lzkldXCVfLdAcF1JHmDUZCOrEISkSJcHv3mUDlLK5n5nMw/M31bOYeh9LEaz+lzjYTlDKS+UkfXrV8HcbT64BXblo0yvTASowUXZ9jQYnWMgR5OXMTRLiedr/XcOHl4O6u4HNwnHtyuXcL7WwhjZt8=; X-YMail-OSG: SUnKdd4VM1nDjiJRP7pDXvx6Ly7zmw0BizUeyQbIQaSHTct 1v.uCNm6TP6hgSI4_ZAw6WzcJzuVuefiz0hOwu2rtnYRLIYDbD6PScXjrSlR EoeI5eAd2QSEsq8fUX53LKPlrPIfKmM_WYr893ziFhVDNl94Hy_BzMBnKt3f zAtIWLU5ICwWqFl6zVF8PwGFXIM_ntpvW7VTMpEPrvyw0KoTNJo9mCCRwANQ 7lDSfXW.c44N3dZtfKpt9ADzKiUxpra.D9DDTEEPS_Z6CSk5G5paBmCJBNxf vUmrfSpjfVxo7Kkh8UmNJFfFoadJVWcclpt5iKM8O5f4vScbmf7muoHMU0HU v1VLZl9uTgKNIk69l6B3yEWQ7ZmSJZDvgxg5uxArywL8PuR.9SnppOcuLZ50 _i1wHKq8LBJBA1WX0UYma72n3evhAwL.nYtWL4JaFUneqA9yZYTChgsWopiF YL3qOvJb6yFhWKCQR2fKu8agBqoN4ROhnWOOst.CSdPKebnIGZSqcrlKlaV_ 8hVuoS6RSsJ54AtkiFcJeAdyo4NXVlS4RWRzTE_GXlJsD7m1Bs6z3UkvOlap Oe1to_YIbYsQ3QOOVvc6MpAE2pWxxZ3dUdv.I7nla43fcNIcRA6v3jALGs_P lUECbke0ogoULINSpMMj1MltXUb6iAZ9cLZoqSNxTh.WumVOTbKzDa9BdK0P f1RiHqTjoHmAPtp8N4DRvozDUAOuDWzS8KxLdJIfkru81TQjrq3.5dJB9pPn E5kh7D5lhNlveKGxNLTA- Received: from [167.220.24.55] by web122305.mail.ne1.yahoo.com via HTTP; Fri, 05 Apr 2013 11:36:04 PDT X-Rocket-MIMEInfo: 002.001,VGhhbmtzLiBUaGlzIGlzIGp1c3QgYSB0ZXN0IGZyb20gbXkgbG9jYWwgYm94LiBTbyBlYWNoIGZpbGUgaXMgb25seSAxa2IuIEkgc2hhcmVkwqB0aGUgcXVlcnkgcGxhbnMgb2YgdGhlc2UgdHdvIHRlc3RzIGF0OgpodHRwOi8vY29kZXRpZHkuY29tL3Bhc3RlL3Jhdy81MTk4Cmh0dHA6Ly9jb2RldGlkeS5jb20vcGFzdGUvcmF3LzUxOTkKwqAKQWxzbyBpbiB0aGUgSGFkb29wIGxvZywgdGhlcmUgaXMgdGhpcyBsaW5lIGZvciBlYWNoIHBhcnRpdGlvbjpvcmcuYXBhY2hlLmhhZG9vcC5oaXZlLnFsLmV4ZWMuTWEBMAEBAQE- X-Mailer: YahooMailWebService/0.8.140.532 References: <1365116479.2198.YahooMailNeo@web122305.mail.ne1.yahoo.com> Message-ID: <1365186964.60050.YahooMailNeo@web122305.mail.ne1.yahoo.com> Date: Fri, 5 Apr 2013 11:36:04 -0700 (PDT) From: Ian Reply-To: Ian Subject: Re: Partition performance To: "user@hive.apache.org" Cc: Dean Wampler In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="-947112896-495886422-1365186964=:60050" X-Virus-Checked: Checked by ClamAV on apache.org ---947112896-495886422-1365186964=:60050 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Thanks. This is just a test from my local box. So each file is only 1kb. I = shared=A0the query plans of these two tests at:=0Ahttp://codetidy.com/paste= /raw/5198=0Ahttp://codetidy.com/paste/raw/5199=0A=A0=0AAlso in the Hadoop l= og, there is this line for each partition:org.apache.hadoop.hive.ql.exec.Ma= pOperator: Adding alias test1 to work list for file hdfs://localhost:8020/t= est1/2011/02/01/01=0ADoes that mean each partition will become a map task?= =0A=A0=0AI'm still new in Hive, just wondering what are the common strategy= for partitioning the hourly logs?=A0I know we shouldn't have too many part= itions but I'm wondering what's the reason behind it? If I run this on a re= al cluster, maybe it won't perform so differently?=0A=A0=0AThanks.=A0 =0A= =0A________________________________=0A From: Dean Wampler =0ATo: user@hive.apache.org =0ASent: Thursday, April 4, = 2013 4:28 PM=0ASubject: Re: Partition performance=0A =0A=0AAlso, how big a= re the files in each directory? Are they roughly the size of one HDFS block= or a multiple. Lots of small files will mean lots of mapper tasks will lit= tle to do.=0A=0AYou can also compare the job tracker console output for eac= h job. I bet the slow one has a lot of very short map and reduce tasks, whi= le the faster one has fewer tasks that run longer. A rule of thumb is that = any one task should take 20 seconds or more to amortize over the few second= s spent in start up per task. =0A=0AIn other words, if you think about what= 's happening at the HDFS and MR level, you can learn to predict how fast or= slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EX= TENDED helps with this. =0A=0Adean=0A=0A=0AOn Thu, Apr 4, 2013 at 6:25 PM, = Owen O'Malley wrote:=0A=0ASee slide #9 from my Optimiz= ing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriesp= ptx . Certainly, we will improve it, but for now you are much better off wi= th 1,000 partitions than 10,000.=0A>=0A>-- Owen=0A>=0A>=0A>=0A>On Thu, Apr = 4, 2013 at 4:21 PM, Ramki Palle wrote:=0A>=0A>Is it= possible for you to send the explain plan of these two queries?=0A>>=0A>>R= egards,=0A>>Ramki.=0A>>=0A>>=0A>>=0A>>=0A>>On Thu, Apr 4, 2013 at 4:06 PM, = Sanjay Subramanian wrote:=0A>>=0A>>Th= e slow down is most possibly due to large number of partitions. =0A>>>I bel= ieve the Hive book authors tell us to be cautious with large number of part= itions :-) =A0and I abide by that. =0A>>>=0A>>> =0A>>>Users =0A>>>Please ad= d your points of view and experiences =0A>>>=0A>>> =0A>>>Thanks =0A>>>sanja= y =0A>>>=0A>>> From: Ian =0A>>>Reply-To: "user@hive.apach= e.org" , Ian =0A>>>Date: Thursday, = April 4, 2013 4:01 PM=0A>>>To: "user@hive.apache.org" =0A>>>Subject: Partition performance=0A>>>=0A>>>=0A>>> =0A>>>Hi, =0A>>>=0A= >>>I created 3 years of hourly log files (totally 26280 files), and use Ext= ernal Table with partition to query. I tried two partition methods. =0A>>>= =0A>>>1). Log files are stored as /test1/2013/04/02/16/000000_0 (A director= y per hour). Use date and hour as partition keys. Add 3 years of directorie= s to the table partitions. So there are 26280 partitions. =0A>>>=A0=A0=A0= =A0=A0=A0=A0=A0CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY = (dt string, hr int); =0A>>>=A0=A0=A0=A0=A0=A0=A0=A0ALTER TABLE test1 ADD PA= RTITION (dt=3D'2013-04-02', hr=3D16) LOCATION '/test1/2013/04/02/16'; =0A>>= >=A0 =0A>>>2). Log files are stored as /test2/2013/04/02/16_000000_0 (A dir= ectory per day, 24 files in each directory). Use date as partition key. Add= 3 years of directories to the table partitions. So there are 1095 partitio= ns. =A0=A0=A0=A0=A0=A0=A0=A0CREATE EXTERNAL TABLE test2 (logline string) PA= RTITIONED BY (dt string); =0A>>>=A0=A0=A0=A0=A0=A0=A0=A0ALTER TABLE test2 A= DD PARTITION (dt=3D'2013-04-02') LOCATION '/test2/2013/04/02'; =0A>>>=A0 = =0A>>>When doing a simple query like =0A>>>=A0=A0=A0=A0SELECT * FROM=A0 te= st1/test2=A0 WHERE=A0 dt >=3D '2013-02-01' and dt <=3D '2013-02-14' =0A>>>= Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. =0A>>>= =0A>>>I'm wondering why there is a big performance difference between these= two? These two approaches have the same number of files, only the director= y structure is different. So Hive is going to load the same amount of files= . Why does the number of partitions have such big impact? Does that mean #2= is a better partition strategy? =0A>>>=A0 =0A>>>Thanks. =0A>>>=0A>>> = =0A>>>=0A>>>CONFIDENTIALITY NOTICE=0A>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A>>>This email message and any attachmen= ts are for the exclusive use of the intended recipient(s) and may contain c= onfidential and privileged information. Any unauthorized review, use, discl= osure or distribution is prohibited. If you are not the intended recipient,= =0A please contact the sender by reply email and destroy all copies of the = original message along with any attachments, from your computer system. If = you are the intended recipient, please be advised that the content of this = message is subject to access, review=0A and disclosure by the sender's Emai= l System Administrator.=0A>>> =0A>> =0A> =0A=0A=0A-- =0ADean Wampler, Ph.D= .=0Athinkbiganalytics.com=0A+1-312-339-1330 ---947112896-495886422-1365186964=:60050 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks. Th= is is just a test from my local box. So each file is only 1kb. I shared&nbs= p;the query plans of these two tests at:
 
Also in the Hadoop log, there is this line for each partition:<= /span>
org.apache.hadoop.hive.ql.exec.MapOperator: Adding a= lias test1 to work list for file hdfs://localhost:8020/test1/2011/02/01/01<= /div>
Does that mean each partition will become a map task?
 
I'm still new in Hive, just wondering what are the c= ommon strategy for partitioning the hourly logs? I know we shouldn't have too many partitions but I'm wondering what's the reason behind it? If= I run this on a real cluster, maybe it won't perform so differently?
=
 
Thanks.
 
From: Dean Wam= pler <dean.wampler@thinkbiganalytics.com>
To: user@hive.apache.org
Sent: Thursday, April 4, 2013 4:28 PM
= Subject: Re: Partition perfor= mance

=0A
Also, how big are the files in each= directory? Are they roughly the size of one HDFS block or a multiple. Lots= of small files will mean lots of mapper tasks will little to do.

<= /div>
You can also compare the job tracker console output for each job.= I bet the slow one has a lot of very short map and reduce tasks, while the= faster one has fewer tasks that run longer. A rule of thumb is that any on= e task should take 20 seconds or more to amortize over the few seconds spen= t in start up per task.
=0A

In other words, if you = think about what's happening at the HDFS and MR level, you can learn to pre= dict how fast or slow things will run. Learning to read the output of EXPLA= IN or EXPLAIN EXTENDED helps with this.
=0A

dean
On Thu, Apr 4, 2013 at 6:25 PM,= Owen O'Malley <om= alley@apache.org> wrote:
=0A
See slide #9 from my Optimizing Hive Queries ta= lk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we= will improve it, but for now you are much better off with 1,000 partitions= than 10,000.=0A=0A
-- Owen


On Thu, Apr 4, 2013= at 4:21 PM, Ramki Palle <ramki.palle@gmail.com> wrote:
=0A=0A
Is it possible = for you to send the explain plan of these two queries?

Regards= ,
=0A=0A
Ramki.


On Thu, Apr 4, 2013 a= t 4:06 PM, Sanjay Subramanian <Sanjay.Subramanian@wizecomm= erce.com> wrote:
=0A=0A=0A
=0A=0A=0A=0A
=0A
The slow down is most possib= ly due to large number of partitions.
=0A
I believe the Hive book = authors tell us to be cautious with large number of partitions :-)  an= d I abide by that.
=0A

=0A
=0A
Users
=0A
Ple= ase add your points of view and experiences
=0A

=0A
=0AThanks
=0A
sanjay
=0A

=0A
=0A=0A
=0A=0A=0A=0AFrom: Ian <liuhy0@yahoo.com>
=0Auser@hive.apach= e.org" <user@hive.apache.org= >, Ian <liuhy0@yahoo.com><= br>=0A=0A=0A=0ADate: Thursday, Ap= ril 4, 2013 4:01 PM
=0ATo: "user@hive.apache.org" <user@hive.apache.org>
=0A=0A=0A= =0ASubject: Partition performance=
=0A
=0A

=0A
=0A
=0A
=0A
=0A
Hi,
=0A
 
=0A
I created 3 years of hourly l= og files (totally 26280 files), and use External Table with partition to qu= ery. I tried two partition methods.
=0A
 
=0A
1). Lo= g files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour).= Use date and hour as partition keys. Add 3 years of directories to the tab= le partitions. So there are 26280 partitions.
=0A
 &nbs= p;      CREATE EXTERNAL T= ABLE test1 (logline string) PARTITIONED BY (dt string, hr int);
=0A        ALTER TABLE t= est1 ADD PARTITION (dt=3D'2013-04-02', hr=3D16) LOCATION '/test1/2013/04/02= /16';
=0A
 
=0A<= div>2). Log files are stored as /test2/2013/04/02/16_000000_0 (= A directory per day, 24 files in each directory). Use date as partition key= . Add 3 years of directories to the table partitions. So there are 1095 par= titions.
=0A=0A=0A=0A=0A
 &nb= sp;      CREATE EXTERNAL = TABLE test2 (logline string) PARTITIONED BY (dt string);
=0A
        ALTER TABLE test2 AD= D PARTITION (dt=3D'2013-04-02') LOCATION '/test2/2013/04/02';=
=0A
 
=0A
Whe= n doing a simple query like
=0A
&= nbsp;   SELECT * FROM  test1/test2  WHERE&n= bsp; dt >=3D '2013-02-01' and dt <=3D '2013-02-14'=0A=0A
Using approach #1 takes 320 seconds, but #2 only takes 7= 0 seconds.
=0A
 
=0A
I'm wondering why = there is a big performance difference between these two? These two approach= es have the same number of files, only the directory structure is different= . So Hive is going to load the same amount of files.=0A Why does the number= of partitions have such big impact? Does that mean #2 is a better partitio= n strategy?
=0A
 =0A
Thanks.
=0A=0A
 =0A
 
=0A
=0A
=0A
=0A
=0A
CONFIDENTIALITY NO= TICE
=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D
=0AThis email message and any attachments are for the exclusive use = of the intended recipient(s) and may contain confidential and privileged in= formation. Any unauthorized review, use, disclosure or distribution is proh= ibited. If you are not the intended recipient,=0A please contact the sender= by reply email and destroy all copies of the original message along with a= ny attachments, from your computer system. If you are the intended recipien= t, please be advised that the content of this message is subject to access,= review=0A and disclosure by the sender's Email System Administrator.
= =0A
=0A
=0A=0A

=0A

=0A



--
Dean Wampler, Ph.D.
+1-312-339-1330
=0A
=0A
=0A


---947112896-495886422-1365186964=:60050--