Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F2BBCF196 for ; Sat, 27 Apr 2013 18:14:47 +0000 (UTC) Received: (qmail 28789 invoked by uid 500); 27 Apr 2013 18:14:46 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 28666 invoked by uid 500); 27 Apr 2013 18:14:45 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 28652 invoked by uid 99); 27 Apr 2013 18:14:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Apr 2013 18:14:45 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jieli@cs.duke.edu designates 152.3.140.1 as permitted sender) Received: from [152.3.140.1] (HELO duke.cs.duke.edu) (152.3.140.1) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Apr 2013 18:14:40 +0000 Received: from mail-ob0-f172.google.com (mail-ob0-f172.google.com [209.85.214.172]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: jieli) by duke.cs.duke.edu (Postfix) with ESMTPSA id 3CD34F800FE for ; Sat, 27 Apr 2013 14:14:19 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=cs.duke.edu; s=mail; t=1367086459; bh=B6YaS13xDNTGGE3rq/GzV6khJgUwcwy19tJq5udwDjA=; h=Date:Subject:From:To; b=62hv2SY9OMtMwE8jyJClWYemx6FzYAl7NmuV/7KxMgbnGibWVIaqE5coYUqmZfMlT Jkhm6WubwyUYNTlNrzlfG7Ev6ZwHIjqybBoWDd6qtPP1EkNb6JAcXlHpQDI8S6GcTw 8wYnDz9lSb6hOdGpYinOGOPAvjtCFia6bqectnAlkumbD0+kDtssNOVPPnDjadik8u mIGk6poaVExiWorGfS4Q3YIOlxZEjDj3ei8zK0WyO3FYRVo3hfS2qRKIon9RtnV8Zv UM0JnrxCLrJIyz7ok1B7b0Ow/1MpUxciJJYzIfjmHc/vHNwh/KeGHZOuqr0b0Q8jCM ir0v8hkm+Cg7w== Received: by mail-ob0-f172.google.com with SMTP id v19so4464080obq.31 for ; Sat, 27 Apr 2013 11:14:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=uV+0rGZq7etyL5BIjA1TWPJLT2GTdt47Ix7c7kgLrCM=; b=eW3/HvbqR1iepGhkf8WIeO6jONpzbmzNO9B+vf3TM0sQ6KYWnqPl4gS4R9aY0kDGUs 2GGX/6I0tGSq8/rAvDIF0MQVxvRhl82CqRjJbU/iuKSEkl0NSFqU+I6zhDzJqp6MD4Jd G+rYYBBZqVBFNztHAxFo8zJHkkGyazgWyf9/5XXteMrnIGzrZzjmYtwWQTAnDbYG9Z8A bSl8hn66avlOGzaDGOPf4Cxd02uCJKWVRKYJ/oy8np3mhr1YmMhBJGDRJcvoA942qUcC NNWLt9yg9n5jhaoAES3dbKqZwEanZ98Jxp1zTue1W5F+tq73AgJ/Ju8zOeMebWqxNYD2 qZpg== MIME-Version: 1.0 X-Received: by 10.60.45.8 with SMTP id i8mr19997247oem.78.1367086458677; Sat, 27 Apr 2013 11:14:18 -0700 (PDT) Received: by 10.182.103.229 with HTTP; Sat, 27 Apr 2013 11:14:18 -0700 (PDT) In-Reply-To: References: Date: Sat, 27 Apr 2013 11:14:18 -0700 Message-ID: Subject: Re: Huge join performance issue From: Jie Li To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e0149ce54bf817a04db5b9e87 X-Virus-Checked: Checked by ClamAV on apache.org --089e0149ce54bf817a04db5b9e87 Content-Type: text/plain; charset=ISO-8859-1 In order for us to understand the performance and identify the bottlenecks, could you do two things: 1) run the EXPLAIN command and share with us the output 2) share with us the hadoop job histories generated by the query. They can be collected following http://www.cs.duke.edu/starfish/tutorial/job_history.html Jie On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov wrote: > Did you verify that all your available mappers are running (and reducers > too)? If you have a small number of partitions with huge files, you might > me underutilizing mappers (check that the files are being split). Also, it > might be optimal to have a single "wave" of reducers by setting the number > of reduce tasks appropriately. > > You might also consider optimizing a simpler query first: > > select t1.a, count(*) > from (select a from table baseTB1 where ... ) t1 -- filter by partition > as well > join > (select a from baseTB2 where ...) t2 -- filter by partition as > well > on t1.a=t2.a > group by t1.a > > just to give you an idea how much overhead the extra columns are adding. > If the columns are pretty big they could be causing the slowdown. > > igor > decide.com > > > On Sat, Apr 6, 2013 at 2:30 PM, Gabi D wrote: > >> Thank you for your answer Nitin. >> Does anyone have additional insight into this? will be >> greatly appreciated. >> >> >> On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar wrote: >> >>> you dont really need subqueries to join the tables which have common >>> columns. Its an additional overhead >>> best way to filter your data and speed up your data processing is how >>> you layout your data >>> When you have larger table I will use partitioning and bucketing to trim >>> down the data and improve the performances over joins >>> >>> distribute by is mainly used when you have your custom map reduce >>> scripts and you want to use transform functionality in hive. I have not >>> used it a lot so not sure on that part. also its helpful to write where >>> clauses in join statements to reduce the dataset you want to join. >>> >>> >>> >>> On Thu, Apr 4, 2013 at 5:53 PM, Gabi D wrote: >>> >>>> Hi all, >>>> I have two tables I need to join and then summarize. >>>> They are both huge (about 1B rows each, in the relevant partitions) and >>>> the query runs for over 2 hours creating 5T intermediate data. >>>> >>>> The current query looks like this: >>>> >>>> select t1.b,t1.c,t2.d,t2.e, count(*) >>>> from (select a,b,c from table baseTB1 where ... ) t1 -- filter by >>>> partition as well >>>> join >>>> (select a,d,e from baseTB2 where ...) t2 -- filter by >>>> partition as well >>>> on t1.a=t2.a >>>> group by t1.b,t1.c,t2.d,t2.e >>>> >>>> >>>> two questions: >>>> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) >>>> be better in any way? >>>> (I know subqueries cause a lot of writes of the intermediate >>>> data but we also understand it's best to filter down the data that is being >>>> joined, which is "more" correct?) >>>> 2. can I use 'distribute by ' and/or 'sort by' in some way that would >>>> help this? my understanding at the moment is that the problem lies in the >>>> fact >>>> that the reduces are on column a while the group by is on column b ... >>>> >>>> Any thoughts would be appreciated. >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> > --089e0149ce54bf817a04db5b9e87 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
In order for us to understand the performance and identify= the bottlenecks, could you do two things:

1) run = the EXPLAIN command and share with us the output
2) share w= ith us the hadoop job histories generated by the query. They can be collect= ed following=A0http://www.cs.duke.edu/starfish/tutorial/job_history.html

Jie



On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <igor@decide.com> wrote:
Did you = verify that all your available mappers are running (and reducers too)? If y= ou have a small number of partitions with huge files, you might me underuti= lizing mappers (check that the files are being split). Also, it might be op= timal to have a single "wave" of reducers by setting the number o= f reduce tasks appropriately.

You might also consider opt= imizing a simpler query first:

select=A0= t1= .a,=A0count(*)
from (select a from table= baseTB1 where ... ) t1 =A0-- filter by partition as well
=A0 join= =A0
= =A0 =A0 = =A0 =A0 (select a =A0from baseTB2 where ...) t2 =A0 =A0-- filter by partiti= on as well
on t1.a=3Dt2.a
group by t1.a

just to give you an i= dea how much overhead the extra columns are adding. If the columns are pret= ty big they could be causing the slowdown.

igor
<= div class=3D"HOEnZb">


<= div class=3D"gmail_quote">On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <gabid33= @gmail.com> wrote:
Thank you for your answer N= itin.
Does anyone have additional insight into this? will be greatly=A0= appreciated.


On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <nitinpawar432@gmail.= com> wrote:
you dont really need subque= ries to join the tables which have common columns. Its an additional overhe= ad=A0
best way to filter your data and speed up your data processing is how you l= ayout your data
When you have larger table I will use partitioning and bucketing to trim do= wn the data and improve the performances over joins=A0

=
distribute by is mainly used when you have your custom map reduce scri= pts and you want to use transform functionality in hive. I have not used it= a lot so not sure on that part. also its helpful to write where clauses in= join statements to reduce the dataset you want to join.=A0



On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <gabid33@gm= ail.com> wrote:
Hi all,
I have two tables I need to jo= in and then summarize.=A0
They are both huge (about 1B rows e= ach, in the relevant partitions) and the query runs for over 2 hours creati= ng 5T intermediate data.

The current query looks like this:

select=A0t1.b,t1.c,t2.d,t2.e,=A0count(*)
from (select a,b,c =A0 =A0from= table baseTB1 where ... ) t1 =A0-- filter by partition as well
=A0 join=A0
=A0 =A0 =A0 =A0 (select a,d,e from baseTB= 2 where ...) t2 =A0 =A0-- filter by partition as well
on t1.a=3Dt2.a
group by t1.b,t1.c,t2.d,t2.e


two questions:
1. would joining=A0<= /span>baseTB1 and=A0baseTB2 directly (instead of subqueries) be= better in any way?
=A0 =A0 =A0 =A0 =A0 (I know su= bqueries cause a lot of writes of the intermediate data but we also underst= and it's best to filter down the data that is being joined, which is &q= uot;more" correct?)
2. can I use 'distribute b= y ' and/or 'sort by' in some way that would help this? my under= standing at the moment is that the problem lies in the fact
that the reduces are on column= a while the group by is on column b ...

Any thoughts woul= d be appreciated.
=



<= font color=3D"#888888">--
Nitin Pawar



--089e0149ce54bf817a04db5b9e87--