Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 47B7718091 for ; Thu, 24 Mar 2016 07:38:23 +0000 (UTC) Received: (qmail 93897 invoked by uid 500); 24 Mar 2016 07:38:20 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 93826 invoked by uid 500); 24 Mar 2016 07:38:20 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 93816 invoked by uid 99); 24 Mar 2016 07:38:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2016 07:38:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DBC88C0C72 for ; Thu, 24 Mar 2016 07:38:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id XDKBec93Jg8D for ; Thu, 24 Mar 2016 07:38:17 +0000 (UTC) Received: from mail-vk0-f54.google.com (mail-vk0-f54.google.com [209.85.213.54]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id E252C5F234 for ; Thu, 24 Mar 2016 07:38:16 +0000 (UTC) Received: by mail-vk0-f54.google.com with SMTP id e6so48555200vkh.2 for ; Thu, 24 Mar 2016 00:38:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=VcyR8rn72eccB48jEJ1FtFX0D5uw1iB6aDQHYyHlJiY=; b=lqWOcYNCuIktNjYXYoz1g+ldSUTD/NMyTkDz+8Liajxr1vMwlx+1hp/yHVrQZ0tOn7 e/KbgFDsS3iMQZ/+px+nFdqmSBjvXZz6KgqKZnSVsu3IsPEflif7VjwRD0tCGwGnzVFe wgwf5W1yV0MZlItZ3udBqsQR+2/pKN1BAqq6YeN2rPBtVxjjcYMs0t8/D6UK4H3oWaXC t1s/LAZDNsBU9whpXaW6tvkLFN7HYs4QTQ5HcxGJUVuuHqj2zGoTPP3PxvK1N8Hhozj/ F41ISv9PDh+T3TfqVDlNqcraleNIZUbry+n15MOlT93IlA6Uv2xIkscdKz3ieBvGIi94 gQIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=VcyR8rn72eccB48jEJ1FtFX0D5uw1iB6aDQHYyHlJiY=; b=gqW6twf/YiUjxQwl3ljGVBfhVOHZOEkCx6dRbTvGls7816MAm3WK+Ak5o7VEy51z/d FCT4EOpROFULirmzBHOfBlrNjwZcNP5NTpapu5nnKUgUAV/nnivRlbMoxhdJJFQ1Yfdl RHOyPlQpn7EV4ZTWf/w6MxdwoaxIeHbakkG3+6gv+RYzLb7HTAjFKnoHOMzWQ5Mc8oUd YQWeKtPLLQAgHweLJnfmYKTAIPV4E1ki4SITTAc3M/bhn01k47uRIRjtiMq90COTFg4m 7qI1zjFXSe8IrBLoFmq4S+cAq8ywJqu3C8/AVLp3E8PHGA0l2k0kKG0Sg6cGB+l6iAj5 bXsg== X-Gm-Message-State: AD7BkJLpObuvQX2cNkzsOC1L3w14Q06dgX4shjpLmU9Xk9MBK7dqpeAKCBVpl/YcrBZ/vWRew0cdOVqyCHo1CQ== MIME-Version: 1.0 X-Received: by 10.31.162.3 with SMTP id l3mr3719476vke.68.1458805095894; Thu, 24 Mar 2016 00:38:15 -0700 (PDT) Received: by 10.31.128.213 with HTTP; Thu, 24 Mar 2016 00:38:15 -0700 (PDT) In-Reply-To: References: <0B79454B00C4474A8C2FD9FCAD600F94431250@APSWP0541.ms.ds.uhc.com> Date: Thu, 24 Mar 2016 07:38:15 +0000 Message-ID: Subject: Re: Issue joining 21 HUGE Hive tables From: Mich Talebzadeh To: user Content-Type: multipart/alternative; boundary=001a114406ea899398052ec68773 --001a114406ea899398052ec68773 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Posting a typical query that you are using will help to clarify the issue. Also you may use TEMPORARY TABLEs to keep the intermediate stage results. On the face of it you can time every query itself to find out the longest components etc select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS StartTime; CREATE TEMPORARY TABLE tmp AS SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales --FROM smallsales s, times t, channels c FROM sales s, times t, channels c WHERE s.time_id =3D t.time_id AND s.channel_id =3D c.channel_id GROUP BY t.calendar_month_desc, c.channel_desc ; select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS FirstQuery; SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales from tmp ORDER BY MONTH, CHANNEL LIMIT 5 ; HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6= zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 24 March 2016 at 06:36, J=C3=B6rn Franke wrote: > Joining so many external tables is always an issue with any component. > Your problem is not Hive specific; but your data model seems to be messed > up. First of all you should have them in an appropriate format, such as O= RC > or parquet and the tables should not be external. Then you should use the > right data types for columns, eg an int instead of a varchar if you have > just numbers in a column. After that check if you can prejoin and store t= he > data in one big flat table and do queries on that. > > Then you should look at the min / max indexes , bloom filters, statistics= , > partitions etc. > > Maybe you can post more details about data model and queries. > > On 24 Mar 2016, at 02:49, Sanka, Himabindu > wrote: > > Hi Team, > > > > I need some inputs from you. I have a requirement for my project where I > have to join 21 hive external tables. > > > > Out of which 6 tables are HUGE having 500 million records of data. Other > 15 tables are smaller ones around 100 to 1000 records each. > > > > When I am doing inner joins/ left outer joins its taking hours to run the > query. > > > > Please let me know some optimization techniques or any other eco system > components that performs better than HIVE. > > > > > > *Regards,* > > Hima > > > > > > > This e-mail, including attachments, may include confidential and/or > proprietary information, and may be used only by the person or entity > to which it is addressed. If the reader of this e-mail is not the intende= d > recipient or his or her authorized agent, the reader is hereby notified > that any dissemination, distribution or copying of this e-mail is > prohibited. If you have received this e-mail in error, please notify the > sender by replying to this message and delete this e-mail immediately. > > --001a114406ea899398052ec68773 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Posting a=C2=A0typical query that you are using will = help to clarify the issue.

Also you may use TEMPOR= ARY TABLEs to keep the intermediate stage results.

On the face of it you can time every query itself to find out the longest = components etc

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH= :mm:ss.ss') AS StartTime;

CREATE TEMPORARY TABLE tmp AS
SELECT t.calendar= _month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales
--FROM sma= llsales s, times t, channels c
FROM sales s, times t, channels c
WHER= E s.time_id =3D t.time_id
AND=C2=A0=C2=A0 s.channel_id =3D c.channel_id<= br>GROUP BY t.calendar_month_desc, c.channel_desc
;
select from_unixt= ime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS FirstQuery;
<= /font>

<= /font>
SELEC= T calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
from= tmp
ORDER BY MONTH, CHANNEL LIMIT 5
;


HTH


On 24 March 2016 at 06:36, J=C3=B6rn F= ranke <jornfranke@gmail.com> wrote:
=
Joining so many external tables is always an issue w= ith any component. Your problem is not Hive specific; but your data model s= eems to be messed up. First of all you should have them in an appropriate f= ormat, such as ORC or parquet and the tables should not be external. Then y= ou should use the right data types for columns, eg an int instead of a varc= har if you have just numbers in a column. After that check if you can prejo= in and store the data in one big flat table and do queries on that.

Then you should look at the min / max indexes , bloom fil= ters, statistics, partitions etc.=C2=A0

Maybe you = can post more details about data model and queries.=C2=A0

On 24 Mar 2016, at 02:49, Sanka, Himabindu <himabindu_sanka@opt= um.com> wrote:

Hi Team,

=C2=A0

I need some inputs from you. I have a requirement fo= r my project where I have to join 21 hive external tables.

=C2=A0

Out of which 6 tables are HUGE =C2=A0having 500 mill= ion records of data. Other 15 tables are smaller ones around 100 to 1000 re= cords each.

=C2=A0

When I am doing inner joins/ left outer joins its ta= king hours to run the query.

=C2=A0

Please let me know some optimization techniques or a= ny other eco system components that performs better than HIVE.

=C2=A0

=C2=A0

Regards,<= /u>

Hima

= =C2=A0

=C2=A0


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended<= br> recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.


--001a114406ea899398052ec68773--