Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A37ECE6F for ; Tue, 5 Jun 2012 12:48:17 +0000 (UTC) Received: (qmail 70237 invoked by uid 500); 5 Jun 2012 12:48:16 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 70173 invoked by uid 500); 5 Jun 2012 12:48:16 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 70159 invoked by uid 99); 5 Jun 2012 12:48:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2012 12:48:15 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dolik.rce@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gh0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2012 12:48:09 +0000 Received: by ghbz10 with SMTP id z10so4323407ghb.35 for ; Tue, 05 Jun 2012 05:47:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=WJaiWaaNn7qac/fBOxcbixsbOu1qU+hWbn8W0Y7+Q24=; b=L/rk1AUJWKbkwzLLSQbTM7utrXj/dVYkAeFhTvdsbA/IK/yHNijK8j3WkO2nZmWKrM kWugTXOZlILl7xG2Jt2E8OghYuK9IFKhM6ajrpigir4wbUqFW58A16UBWNTf5VU6zWK4 C5n82NZp5d8On27ja4gnpNeCSolqpvJMXH/MeAnDIttfpZ0G7V0zLHTfCi/RTv/JxXId FumPkoHTOzYVNmcJfvinI4GxPK1/f+mjXImQD3epcx2ZIHFHUVdi+Y6btNWOEsDBh5w7 fVEGQ7OPnCLtCKWChnzdzKKoNSBN4OQMiUeZN1goXVTkNv0Y5ayPqV154UfBMG93HePv sIkA== MIME-Version: 1.0 Received: by 10.236.79.202 with SMTP id i50mr8667941yhe.99.1338900468531; Tue, 05 Jun 2012 05:47:48 -0700 (PDT) Received: by 10.236.46.202 with HTTP; Tue, 5 Jun 2012 05:47:48 -0700 (PDT) In-Reply-To: References: Date: Tue, 5 Jun 2012 14:47:48 +0200 Message-ID: Subject: Re: Multi-GroupBy-Insert optimization From: =?UTF-8?B?SmFuIERvbGluw6Fy?= To: user@hive.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Shan, If you happen to have a lot of repeated data (in the most general grouping), you might get some speedup by little pre-aggregation. The following code should produce the same results as the example in your first post: >From ( SELECT a, b , c, count(*) AS cnt FROM X group by a,b,c ) t INSERT OVERWRITE LOCAL DIRECTORY 'output/y1' SELECT a, b , c, cnt INSERT OVERWRITE LOCAL DIRECTORY 'output/y2' SELECT a , SUM(cnt) group by a INSERT OVERWRITE LOCAL DIRECTORY 'output/y3' SELECT b, SUM(cnt) group by b ... The trick is that there there will be one more job that will first reduce the number of records that are used in the following jobs. They will only have to read one line for each distinct triplet a,b,c. Note that this will only help if the number of distinct combinations is relatively low compared to the total amount of data. In other cases it might make no difference or even make the calculation longer. Hope that helps... I can't think about anything else that could help you. Jan On 6/5/12, Jan Dolin=C3=A1r wrote: > On 6/4/12, shan s wrote: >> Thanks for the explanation Jan. >> If I understand correctly, the input will be read one single time and >> will >> be preprocessed in some form, and this intermediate data is used for >> subsequent group-by.. >> Not sure if my scenario will help this single step, since group-by varie= s >> across vast entities. > > Yes, that is that is correct. The very simplest use case is when you > only scan a part of table. But if you are interested in all the data, > it is not going to help you much. > >> If I were to implement group-by,manually, generally we could club them >> together in single program. Can I do better with hive, with some >> hints/optimizations? >> Or is there a possibility that Pig might perform better in this case.( >> Assuming Pig would probably handle this in a single job?) > > In some cases it might be able to outsmart the hive optimizer and > write the mapreduce job directly in java in such way that it might > perform better. In most cases though, it is probably not worth the > trouble. You might easily end up in situation where buying more > machines is cheaper than developing the low level solutions that might > or might not be slightly faster... I'm not familiar with Pig or any > other tools that might be of use in your situation. > > Jan >