Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 273B8178C3 for ; Thu, 20 Aug 2015 08:49:59 +0000 (UTC) Received: (qmail 55379 invoked by uid 500); 20 Aug 2015 08:49:57 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 55301 invoked by uid 500); 20 Aug 2015 08:49:57 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 55291 invoked by uid 99); 20 Aug 2015 08:49:57 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Aug 2015 08:49:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DFA4D1AA88A for ; Thu, 20 Aug 2015 08:49:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=kenshoo.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id x77_bla8bXYH for ; Thu, 20 Aug 2015 08:49:55 +0000 (UTC) Received: from mail-vk0-f46.google.com (mail-vk0-f46.google.com [209.85.213.46]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 042A621044 for ; Thu, 20 Aug 2015 08:49:55 +0000 (UTC) Received: by vkif69 with SMTP id f69so7426289vki.3 for ; Thu, 20 Aug 2015 01:49:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kenshoo.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Cb1lyFI+YPJyfZEWAlvVKLC/tBviYaOAQcWCCvEzpzA=; b=XNSwajAZT6mCkXt5nUy7Z1BuNc8yM/AFSV5gdJjPNxIH8qp2HY0Ty4XqBBxXro2o2K L7dmkBcJ6V6jfi1n2O+5Z0Ef0mvtfntaAb7n6imRCPWknRxC7mn7S3rwef9nh/LJ43cS HlnEq4l2vP4VUdHcoSEhfEVwpv7HGo3jLYIuU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=Cb1lyFI+YPJyfZEWAlvVKLC/tBviYaOAQcWCCvEzpzA=; b=f7hhTyqlSfKSRUX/vtMUGXC2XgxKh1uJDazPADmA6RRw57Vv1/IbzwBOVKyelFhQmx Gnu8kwC4yPA2JAnYHR1cxqpoYnMTRQvntFzMdjTK6jWYUXgMJ9D0ITLmrrMoeUTkPwYW ruVUrqNZGgdKP6eI/xFLvA6ZBuDz0NT1lHydlhMzrJxIptKgqfLej1Pbb6/E1kWkUjVN x/Kps7Vdp0wQddMcCEgRRa3NWjdz5sM5/d2hsf/ZUMtcFqjAWBzIK0yDdadI3qrpceNg LFfJQvkvZZvvMUrgUP4+LQFmgMvKKJsVJf1RA/hHm7BV1U7UPM1idfl6CtBF+GmeIRKM U/7g== X-Gm-Message-State: ALoCoQksjRCm/YmcBhSjI2CEhvH69UYLg/ZPnZ7eb7P0AlMDfV/MTYXq2WfAUrMRS5ULZ3GiEKGmP8b3j4GS/pWC0eOgXXgeVeXv6s3hLA/FbGxn4CddrKTdfElTmuaSFFDT1k8xD7LQ X-Received: by 10.52.230.197 with SMTP id ta5mr2140395vdc.86.1440060593971; Thu, 20 Aug 2015 01:49:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.54.21 with HTTP; Thu, 20 Aug 2015 01:49:34 -0700 (PDT) In-Reply-To: References: From: Noam Hasson Date: Thu, 20 Aug 2015 11:49:34 +0300 Message-ID: Subject: Re: HIVE:1.2, Query taking huge time To: user@hive.apache.org Cc: dev@hive.apache.org Content-Type: multipart/alternative; boundary=089e0111d6b028c27e051dba3cd1 --089e0111d6b028c27e051dba3cd1 Content-Type: text/plain; charset=UTF-8 Hi, Have you look at counters in Hadoop side? It's possible you are dealing with a bad join which causes multiplication of items, if you see huge number of record input/output in map/reduce phase and keeps increasing that's probably the case. Another thing I would try is to divide the job into several different smaller queries, for example start with filter only, after than join and so on. Noam. On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal wrote: > Dear Hive Users, > > I am in process of running over a poc to one of my customer demonstrating > the huge performance benefits of Hadoop BigData using Hive. > > Following is the problem statement i am stuck with. > > I have generate a large table with 28 columns( all are double). Table size > on disk is 70GB (i ultimately created compressed table using ORC format to > save disk space bringing down the table size to < 1GB) with more than > 450Million records. > > In order to demonstrate a complex use case i joined this table with > itself. Following are the queries i have used to create table and join > query i am using. > > *Create Table and Loading Data, Hive parameters settigs:* > set hive.vectorized.execution.enabled = true; > set hive.vectorized.execution.reduce.enabled = true; > set mapred.max.split.size=100000000; > set mapred.min.split.size=1000000; > set hive.auto.convert.join=false; > set hive.enforce.sorting=true; > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set mapreduce.reduce.input.limit=-1; > set hive.exec.parallel = true; > > CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3 > double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 > double,col10 double,col11 double,col12 double,col13 double,col14 > double,col15 double,col16 double,col17 double,col18 double,col19 > double,col20 double,col21 double,col22 double,col23 double,col24 > double,col25 double,col26 double,col27 double,col28 double) > clustered by (col1) sorted by (col1) into 240 buckets > STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); > > from huge_numeric_table insert overwrite table huge_numeric_table_orc2 > select * sort by col1; > > > *JOIN QUERY:* > > select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5 > from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2 > on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12 > > > *The problem is that this query gets stuck at reducers :80-85%. and goes > in a loop and never finishes. * > > Version of Hive is 1.2. > > Please help. > > > Thanks and Regards > Nishant Aggarwal, PMP > Cell No:- +91 99588 94305 > > -- This e-mail, as well as any attached document, may contain material which is confidential and privileged and may include trademark, copyright and other intellectual property rights that are proprietary to Kenshoo Ltd, its subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be read, copied and used only by the addressee for the purpose(s) for which it was disclosed herein. If you have received it in error, please destroy the message and any attachment, and contact us immediately. If you are not the intended recipient, be aware that any review, reliance, disclosure, copying, distribution or use of the contents of this message without Kenshoo's express permission is strictly prohibited. --089e0111d6b028c27e051dba3cd1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

Have you look at counters in Hadoop= side? It's possible you are dealing with a bad join which causes multi= plication of items, if you see huge number of record input/output in map/re= duce phase and keeps increasing that's probably the case.
Another thing I would try is to divide the job into several dif= ferent smaller queries, for example start with filter only, after than join= and so on.

Noam.

On Thu, Aug 20, 2015 at 10:55 AM, Nish= ant Aggarwal <nishant.agg@gmail.com> wrote:
Dear Hive Users,

I= am in process of running over a poc to one of my customer demonstrating th= e huge performance benefits of Hadoop BigData using Hive.

Following= is the problem statement i am stuck with.

I ha= ve generate a large table with 28 columns( all are double). Table size on d= isk is 70GB (i ultimately created compressed table using ORC format to save= disk space bringing down the table size to < 1GB) with more than 450Mil= lion records.

In order to demonstrate a complex use case = i joined this table with itself. Following are the queries i have used to c= reate table and=C2=A0 join query i am using.

Create Table and Loading Data, Hive parameters settigs:<= br>set hive.vectorized.execution.enabled =3D true;
set hive.vectorized.e= xecution.reduce.enabled =3D true;
set mapred.max.split.size=3D100000000;=
set mapred.min.split.size=3D1000000;
set hive.auto.convert.join=3Dfa= lse;
set hive.enforce.sorting=3Dtrue;
set hive.enforce.bucketing=3Dtr= ue;
set hive.exec.dynamic.partition=3Dtrue;
set hive.exec.dynamic.par= tition.mode=3Dnonstrict;
set mapreduce.reduce.input.limit=3D-1;
set h= ive.exec.parallel =3D true;

CREATE TABLE huge_numeric_table_orc2(col= 1 double,col2 double,col3 double,col4 double,col5 double,col6 double,col7 d= ouble,col8 double,col9 double,col10 double,col11 double,col12 double,col13 = double,col14 double,col15 double,col16 double,col17 double,col18 double,col= 19 double,col20 double,col21 double,col22 double,col23 double,col24 double,= col25 double,col26 double,col27 double,col28 double)=C2=A0
clustered by= (col1) sorted by (col1) into 240 buckets
STORED AS ORC tblproperties (= "orc.compress"=3D"SNAPPY");

from huge_numeric_ta= ble insert overwrite table huge_numeric_table_orc2 select * sort by col1;

JOIN QUERY:

sele= ct (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5 from h= uge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2 on t1.= col1=3Dt2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
=


The problem is that this query gets stuck at red= ucers :80-85%. and goes in a loop and never finishes.
Version of Hive is 1.2.

Please help.


Thanks a= nd Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305


This e-mail, as well as any attached document, may contain material which i= s confidential and privileged and may include trademark, copyright and othe= r intellectual property rights that are proprietary to Kenshoo Ltd, =C2=A0i= ts subsidiaries or affiliates ("Kenshoo"). This e-mail and its at= tachments may be read, copied and used only by the addressee for the purpos= e(s) for which it was disclosed herein. If you have received it in error, p= lease destroy the message and any attachment, and contact us immediately. I= f you are not the intended recipient, be aware that any review, reliance, d= isclosure, copying, distribution or use of the contents of this message wit= hout Kenshoo's express permission is strictly prohibited. --089e0111d6b028c27e051dba3cd1--