Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 63026 invoked from network); 18 Dec 2009 19:04:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Dec 2009 19:04:12 -0000 Received: (qmail 5923 invoked by uid 500); 18 Dec 2009 19:04:11 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 5888 invoked by uid 500); 18 Dec 2009 19:04:11 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 5879 invoked by uid 99); 18 Dec 2009 19:04:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Dec 2009 19:04:11 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.216.201] (HELO mail-px0-f201.google.com) (209.85.216.201) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Dec 2009 19:04:04 +0000 Received: by pxi39 with SMTP id 39so2210628pxi.2 for ; Fri, 18 Dec 2009 11:03:44 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.67.25 with SMTP id p25mr2701892wfa.292.1261163024120; Fri, 18 Dec 2009 11:03:44 -0800 (PST) In-Reply-To: <82979CF82115834EAB542F32B32FBFC9093D0379@RHV-EXM-03.corp.ebay.com> References: <82979CF82115834EAB542F32B32FBFC9093CFDF6@RHV-EXM-03.corp.ebay.com> <82979CF82115834EAB542F32B32FBFC9093CFE4A@RHV-EXM-03.corp.ebay.com> <45f85f70912171217k5227648k72cadc402ba9c78c@mail.gmail.com> <82979CF82115834EAB542F32B32FBFC9093CFFA8@RHV-EXM-03.corp.ebay.com> <45f85f70912171623g2c4b895fwab6d29e7bb8bbf4c@mail.gmail.com> <82979CF82115834EAB542F32B32FBFC9093D0379@RHV-EXM-03.corp.ebay.com> From: Todd Lipcon Date: Fri, 18 Dec 2009 11:03:24 -0800 Message-ID: <45f85f70912181103t7063e248gda9954ff4c55101@mail.gmail.com> Subject: Re: Throttling hive queries To: hive-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636e904c80e7d8c047b0566db --001636e904c80e7d8c047b0566db Content-Type: text/plain; charset=ISO-8859-1 Hi Sagi, Sounds like you need CombineFileInputFormat. See: https://issues.apache.org/jira/browse/HIVE-74 -Todd On Fri, Dec 18, 2009 at 10:24 AM, Sagi, Lee wrote: > Yes that's true, I have a process that runs and pulls 3 weblogs files > every hour from 10 servers...10*3*24=720 (not all hours have all the files) > > > Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 | > Cell: 718-930-7947 > > > ------------------------------ > *From:* Todd Lipcon [mailto:todd@cloudera.com] > *Sent:* Thursday, December 17, 2009 4:24 PM > > *To:* hive-user@hadoop.apache.org > *Subject:* Re: Throttling hive queries > > Hi Sagi, > > Any chance you're running on a directory that has 614 small files? > > -Todd > > On Thu, Dec 17, 2009 at 2:30 PM, Sagi, Lee wrote: > >> Todd, Here is the job info. >> >> >> >> Counter Map Reduce Total File Systems HDFS bytes read 199,115,508 0 >> 199,115,508 HDFS bytes written 0 9,665,472 9,665,472 Local bytes read 0 >> 321,210,205 321,210,205 Local bytes written 204,404,812 321,210,205 >> 525,615,017 Job Counters Launched reduce tasks 0 0 1 Rack-local map tasks >> 0 0 614 Launched map tasks 0 0 37,130 Data-local map tasks 0 0 36,516 >> org.apache.hadoop.hive.ql.exec.FilterOperator$Counter PASSED 0 10,572 >> 10,572 FILTERED 0 217,305 217,305 >> org.apache.hadoop.hive.ql.exec.MapOperator$Counter DESERIALIZE_ERRORS 0 0 >> 0 Map-Reduce Framework Reduce input groups 0 429,557 429,557 Combine >> output records 0 0 0 Map input records 429,557 0 429,557 Reduce output >> records 0 0 0 Map output bytes 201,425,848 0 201,425,848 Map input bytes >> 199,115,508 0 199,115,508 Map output records 429,557 0 429,557 Combine >> input records 0 0 0 Reduce input records 0 429,557 429,557 >> >> >> Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 | >> Cell: 718-930-7947 >> >> >> ------------------------------ >> *From:* Todd Lipcon [mailto:todd@cloudera.com] >> *Sent:* Thursday, December 17, 2009 12:18 PM >> >> *To:* hive-user@hadoop.apache.org >> *Subject:* Re: Throttling hive queries >> >> Hi Lee, >> >> The MapReduce framework in general makes it hard for you assign fewer >> mappers than there are blocks in the input data, when using FileInputFormat. >> Is your input set about 42GB with a 64M block size, or 84G with a 128M block >> size? >> >> -Todd >> >> On Thu, Dec 17, 2009 at 11:32 AM, Sagi, Lee wrote: >> >>> Here is the query that I am running, just in case someone has an idea of >>> how to improve it. >>> >>> SELECT >>> CONCAT(CONCAT('"', PRSS.DATE_KEY), '"'), >>> CONCAT(CONCAT('"', PRSC.DATE_KEY), '"'), >>> CONCAT(CONCAT('"', PRSS.VOTF_REQUEST_ID), '"'), >>> CONCAT(CONCAT('"', PRSC.VOTF_REQUEST_ID), '"'), >>> CONCAT(CONCAT('"', PRSS.PRS_REQUEST_ID), '"'), >>> CONCAT(CONCAT('"', PRSC.PRS_REQUEST_ID), '"'), >>> ... >>> ... >>> ... >>> FROM >>> FCT_PRSS PRSS FULL OUTER JOIN FCT_PRSC PRSC ON >>> (PRSS.PRS_REQUEST_ID = PRSC.PRS_REQUEST_ID) >>> WHERE (PRSS.date_key >= '2009121600' AND >>> PRSS.date_key < '2009121700') OR >>> (PRSC.date_key >= '2009121600' AND >>> PRSC.date_key < '2009121700') >>> >>> >>> Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 | >>> Cell: 718-930-7947 >>> >>> -----Original Message----- >>> From: Edward Capriolo [mailto:edlinuxguru@gmail.com] >>> Sent: Thursday, December 17, 2009 11:03 AM >>> To: hive-user@hadoop.apache.org >>> Subject: Re: Throttling hive queries >>> >>> You should be able >>> >>> hive > set mapred.map.tasks=1000 >>> hive > set mapred.reduce.tasks=5 >>> >>> In some cases mappers is controlled by input files (pre hadoop 20) >>> >>> >>> On Thu, Dec 17, 2009 at 1:58 PM, Sagi, Lee wrote: >>> > Is there a way to throttle hive queries? >>> > >>> > For example, I want to tell hive to not use more then 1000 mappers and >>> >>> > 5 reducers for a particular query (or session). >>> > >>> >> >> > --001636e904c80e7d8c047b0566db Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Sagi,

Sounds like you need CombineFileInputFormat. See:

https://issues.apac= he.org/jira/browse/HIVE-74

-Todd

On Fri, Dec 18, 2009 at 10:24 AM, Sagi, Lee <lsagi@shopping.com> wrote:
=
Yes that's true, I have a process that runs and pulls 3=20 weblogs files every hour from 10 servers...10*3*24=3D720 (not all hours hav= e all=20 the files)
=A0

= Lee Sagi | Data=20 Warehouse Tech Lead & Architect | Work: 650-616-6575 | Cell:=20 718-930-7947

=A0


From: Todd Lipcon= [mailto:todd@cloude= ra.com]=20
Sent: Thursday, December 17, 2009 4:24 PM
<= div class=3D"h5">
To:=20 hive-user@= hadoop.apache.org
Subject: Re: Throttling hive=20 queries

Hi Sagi,

Any chance you're running on a directory tha= t has 614=20 small files?

-Todd

On Thu, Dec 17, 2009 at 2:30 PM, Sagi, Lee <ls= agi@shopping.com> wrote:
Todd, Here is the job= =20 info.
=A0


Counter Map Reduce Total
File Systems HDFS bytes read 199,115,508 0 199,115,508
HDFS bytes written 0 9,665,472 9,665,472
Local bytes read 0 321,210,205 321,210,205
Local bytes written 204,404,812 321,210,205 525,615,017
Job Counters Launched reduce tasks 0 0 1
Rack-local map tasks 0 0 614
Launched map tasks 0 0 37,130
Data-local map tasks 0 0 36,516
org.apache.hadoop.hive.ql.exec.FilterOperator$Count= er PASSED 0 10,572 10,572
FILTERED 0 217,305 217,305
org.apache.hadoop.hive.ql.exec.MapOperator$Counter DESERIALIZE_ERRORS 0 0 0
Map-Reduce Framework Reduce input groups 0 429,557 429,557
Combine output records 0 0 0
Map input records 429,557 0 429,557
Reduce output records 0 0 0
Map output bytes 201,425,848 0 201,425,848
Map input bytes 199,115,508 0 199,115,508
Map output records 429,557 0 429,557
Combine input records 0 0 0
Reduce input records 0 429,557 429,557
=A0

Lee Sagi | Data=20 Warehouse Tech Lead & Architect | Work: 650-616-6575 | Cell:=20 718-930-7947

=A0


From: Todd Lipcon [mailto:todd@cloudera.com]=20
Sent: Thursday, December 17, 2009 12:18 PM

To: hive-user@hadoop.apache.org
Subject: Re:=20 Throttling hive queries

Hi Lee,

The MapReduce framework in general makes it har= d for=20 you assign fewer mappers than there are blocks in the input data, when us= ing=20 FileInputFormat. Is your input set about 42GB with a 64M block size, or 8= 4G=20 with a 128M block size?

-Todd

On Thu, Dec 17, 2009 at 11:32 AM, Sagi, Lee <lsagi@shopping.com> wrote:
Here is the que= ry that I am running, just in case someone=20 has an idea of
how to improve it.

SELECT
=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSS.DATE_KEY), '"'),=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSC.DATE_KEY), '"'),=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSS.VOTF_REQUEST_ID), '"&#= 39;),
=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSC.VOTF_REQUEST_ID), '"&#= 39;),
=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSS.PRS_REQUEST_ID), '"= 9;),
=A0 =A0=20 =A0CONCAT(CONCAT('"', PRSC.PRS_REQUEST_ID), '"= 9;),
=A0 =A0=20 =A0...
=A0 =A0 =A0...
=A0 =A0=20 =A0...
=A0FROM
=A0 =A0 =A0FCT_PRSS PRSS FULL OUTER JOIN=20 FCT_PRSC PRSC ON
(PRSS.PRS_REQUEST_ID =3D=20 PRSC.PRS_REQUEST_ID)
=A0WHERE (PRSS.date_key >=3D '2009121600= '=20 AND
=A0 =A0 =A0 =A0PRSS.date_key < '2009121700')=20 OR
=A0 =A0 =A0 (PRSC.date_key >=3D '2009121600' AND
= =A0=20 =A0 =A0 =A0PRSC.date_key < '2009121700')


Lee Sagi | Data Warehouse Tech Lead & Architect | Work= :=20 650-616-6575 |
Cell: 718-930-7947

-----Original=20 Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Thurs= day, December 17,=20 2009 11:03 AM
To: hive-user@hadoop.apache.org
Subject: Re: Throttling= =20 hive queries

You should be able

hive > set mapred.map.tasks=3D1000hive=20 > set mapred.reduce.tasks=3D5

In some cases mappers is contro= lled by=20 input files (pre hadoop 20)


On Thu, Dec 17, 2009 at 1:58 PM,= =20 Sagi, Lee <l= sagi@shopping.com> wrote:
> Is there a way to=20 throttle hive queries?
>
> For example, I want to tell hive= to=20 not use more then 1000 mappers and

> 5 reducers for a particu= lar=20 query (or=20 session).
>



--001636e904c80e7d8c047b0566db--