Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B8F4D1DD for ; Tue, 16 Oct 2012 05:18:19 +0000 (UTC) Received: (qmail 33820 invoked by uid 500); 16 Oct 2012 05:18:18 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 33463 invoked by uid 500); 16 Oct 2012 05:18:14 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 33443 invoked by uid 99); 16 Oct 2012 05:18:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 05:18:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.210.48] (HELO mail-da0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 05:18:08 +0000 Received: by mail-da0-f48.google.com with SMTP id z8so3740286dad.35 for ; Mon, 15 Oct 2012 22:17:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=I8jb2G3vUKkR09Y4CErLvJXUFHBbfTp2DawkaEFXa0c=; b=PhoxBMh5sHnvH1pICMi1mEwy+HIVnMum8eTZCBTOVQmKNUqqJlXbLOLu4bG4Yt5koO dpoCnn7vn6wxZ1mcNa2pnrBokkxk8iYv3uuWgUw0n+MGT/4eqGCajoXL5dHzKA/gIuaa Gcz9ugptTCp0C5ticZC0C2EH5lHealq/4IYcCZjMeIHa4nj6xd+hcDe91mFDQ0Yia6Ar QN3I4YfCDFkrpCOZqLbaHkVj8/aXl3Po+b9kVhcA5/sPLSiYqPpoCW6Nz9ayRgcUyNyw 8ntwwSP+pNgZiq1a7BKURhzAMrGLndZqvADDwmdmvFNVcreXT2B3wUeA+njhKgUmaU0G nb2w== MIME-Version: 1.0 Received: by 10.68.130.201 with SMTP id og9mr43686251pbb.12.1350364667419; Mon, 15 Oct 2012 22:17:47 -0700 (PDT) Received: by 10.68.220.229 with HTTP; Mon, 15 Oct 2012 22:17:47 -0700 (PDT) In-Reply-To: References: Date: Tue, 16 Oct 2012 14:17:47 +0900 Message-ID: Subject: Re: Hive Query Unable to distribute load evenly in reducers From: =?EUC-KR?B?TmF2aXO3+b3Cv+w=?= To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7b10c8c9519c6504cc264649 X-Gm-Message-State: ALoCoQkR3qrqAfym+Hgtq8ctZpqOhM+X+R8j8HxVZ0VsE3TYTs9eUqu0ko6eEzfDcTJsqAE1QEqc X-Virus-Checked: Checked by ClamAV on apache.org --047d7b10c8c9519c6504cc264649 Content-Type: text/plain; charset=ISO-8859-1 How about using MapJoin? 2012/10/16 Saurabh Mishra > no there is apparently no heavy skewing. also another stats i wanted to > point was, following is approximate table contents in this 4 table join > query : > tableA : 170 million (actual number, + i am also exploding these records, > so the number could be much much higher) > tableB:15 > tableC:45 > tableD:45 > tableE : 45 > tableF : 14000 > > Also i cannot put any filter condition on tableA ,situation does not > permit so. :( > Kindly suggest, some alternative solution or some hive configuration to > better load distribute in the reducers > > > Date: Mon, 15 Oct 2012 16:29:56 +0100 > > > Subject: Re: Hive Query Unable to distribute load evenly in reducers > > From: philip.j.tromans@gmail.com > > To: user@hive.apache.org > > > > > Is your data heavily skewed towards certain values of a.x etc? > > > > On 15 October 2012 15:23, Saurabh Mishra > wrote: > > > The queries are simple joins, something on the lines of > > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... > group > > > by a, b,c; > > > > > > > > >> From: liy099@gmail.com > > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > > >> To: user@hive.apache.org > > > > > >> > > >> And your queries were? > > >> > > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > > >> wrote: > > >> > Hi, > > >> > I am firing some hive queries joining tables containing upto > 30millions > > >> > records each. Since the load on the reducers is very significant in > > >> > these > > >> > cases, i specifically set the following parameters before executing > the > > >> > queries : > > >> > > > >> > set mapred.reduce.tasks=100; > > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > > >> > set hive.optimize.cp=true; > > >> > > > >> > The number of reducer the job spouts in now 160, but despite the > high > > >> > number > > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > > >> > statistics, 158 reducers go completed with 2-3 minutes of start and > 2 > > >> > reducers took 2 hrs to run. > > >> > Is there any way to overcome this load distribution disparity. > > >> > Any help in this regards will be highly appreciated. > > >> > > > >> > Sincerely > > >> > Saurabh Mishra > --047d7b10c8c9519c6504cc264649 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable How about using MapJoin?

2012/10/16 Saura= bh Mishra <saurabhmishra.iitg@outlook.com>
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
no there is apparently no heavy skewing. also another= stats i wanted to point was, following is approximate table contents in th= is 4 table join query :
tableA : 170 million (actual number, + i am als= o exploding these records, so the number could be much much higher)
tableB:15
tableC:45
tableD:45
tableE : 45
tableF=A0 : 14000
=
Also i cannot put any filter condition on tableA ,situation does not pe= rmit so. :(
Kindly suggest, some alternative solution or some hive conf= iguration to better load distribute in the reducers

> Date: Mon, 15 Oct 2012 16:29:56 +0100

> Subject: Re: Hive Query Unable to distribute load evenly in r= educers
> From: philip.j.tromans@gmail.com
> To: user@hiv= e.apache.org

>
> Is your data heavi= ly skewed towards certain values of a.x etc?
>
> On 15 October= 2012 15:23, Saurabh Mishra <saurabhmishra.iitg@outlook.com> wrote:
> > The queries are simple joins, something on the lines of
> &= gt; select a, b, c, count(D) from tableA join tableB on a.x=3Db.y join.... = group
> > by a, b,c;
> >
> >
> >> Fr= om: liy099@gmail.com<= /a>
> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subjec= t: Re: Hive Query Unable to distribute load evenly in reducers
> >= > To:
user@hiv= e.apache.org
> >
> >>
> >> And your queries were?
> = >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra> >> <saurabhmishra.iitg@outlook.com> wrote:
> >> > Hi,
> >> > I am firing some hive queries = joining tables containing upto 30millions
> >> > records eac= h. Since the load on the reducers is very significant in
> >> &= gt; these
> >> > cases, i specifically set the following parameters befor= e executing the
> >> > queries :
> >> >
&g= t; >> > set mapred.reduce.tasks=3D100;
> >> > set h= ive.exec.reducers.bytes.per.reducer=3D500000000;
> >> > set hive.optimize.cp=3Dtrue;
> >> >
&g= t; >> > The number of reducer the job spouts in now 160, but despi= te the high
> >> > number
> >> > most of the = load remains upon 1 or 2 reducers. Hence in the final
> >> > statistics, 158 reducers go completed with 2-3 minutes o= f start and 2
> >> > reducers took 2 hrs to run.
> >= ;> > Is there any way to overcome this load distribution disparity. > >> > Any help in this regards will be highly appreciated.
= > >> >
> >> > Sincerely
> >> > Sa= urabh Mishra

--047d7b10c8c9519c6504cc264649--