Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 07281200CB4 for ; Tue, 13 Jun 2017 01:42:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 05B84160BDE; Mon, 12 Jun 2017 23:42:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 49E50160BD9 for ; Tue, 13 Jun 2017 01:42:46 +0200 (CEST) Received: (qmail 61540 invoked by uid 500); 12 Jun 2017 23:42:40 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 61530 invoked by uid 99); 12 Jun 2017 23:42:40 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jun 2017 23:42:40 +0000 Received: from [10.45.0.84] (70-89-66-137-ca.sfba.hfc.comcastbusiness.net [70.89.66.137]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 1929C1A0312; Mon, 12 Jun 2017 23:42:37 +0000 (UTC) User-Agent: Microsoft-MacOutlook/f.19.0.160817 Date: Mon, 12 Jun 2017 16:42:25 -0700 Subject: Re: Hive query on ORC table is really slow compared to Presto From: Gopal Vijayaraghavan Sender: Gopal Vijayaraghavan To: "user@hive.apache.org" CC: Premal Shah Message-ID: <92978D5B-6EB7-4269-BDE5-21051EDD2EE6@hortonworks.com> Thread-Topic: Hive query on ORC table is really slow compared to Presto References: In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable archived-at: Mon, 12 Jun 2017 23:42:47 -0000 Hi, I think this is worth fixing because this seems to be triggered by the data= quality itself - so let me dig in a bit into a couple more scenarios. > hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimize= r now (which came up after your original email). https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#C= onfigurationProperties-hive.optimize.countdistinct > On running the orcfiledump utility, I see that the column on which I want= to run the distinct query is encoded with a DIRECT encoding.=C2=A0 When I run d= istinct on other columns in the table that are encoded with the dictionary e= ncoding, the query runs quickly.=C2=A0 So the cut-off for dictionary encoding is that the value repeats at least ~= 2x in each stripe - so very unique patterns won't trigger this. If the total # of rows of IP =3D=3D total IP values, I don't expect it to be en= coded as a dictionary. Also interesting detail - I prefer to now store IPs as 2 bigint cols. bigint ip1, bigint ip2 This was primarily driven by the crazy math required to join different cont= ractions of the IPv6 formatting. The two colon contractions are crazy when you want to joins across differen= t data sources, if you store as a text string. Maybe 2017 is the year of IPv= 6 :D. > CLUSTERED BY (ip)=C2=A0INTO 16 BUCKETS This is something that completely annoys me - CLUSTERED BY does not cluster= , but that doesn't help you here since IP is unique. You need SORTED BY (ip) to properly generate clusters in Hive. > Running a count(distinct) query on master id took 3+ hours. It looks like= the CPU was busy when running this query. Can you do me a favour and run some intermediate state data exploratory que= ries, because some part of the slowness is probably triggered due to the fai= lure tolerance checkpoints. count(distinct hash(ip)) from the table?=20 count count(1) as collisions, hash(ip) from table group by hash(ip) order b= y collisions desc limit 10; And, if those show many collisions set tez.runtime.io.sort.mb=3D640; set hive.map.aggr=3Dfalse; set tez.runtime.pipelined.shuffle=3Dtrue; // this reduces failure tolerance (= i.e retries are more expensive, happy path is faster) select count(distinct ip) from ip_table; Cheers, Gopal=20