Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <82979CF82115834EAB542F32B32FBFC9093D0379@RHV-EXM-03.corp.ebay.com>
References: 
 <82979CF82115834EAB542F32B32FBFC9093CFDF6@RHV-EXM-03.corp.ebay.com>
	<cbbf4b570912171102j1e2fffe8tc032ca0fecb95ef7@mail.gmail.com>
	<82979CF82115834EAB542F32B32FBFC9093CFE4A@RHV-EXM-03.corp.ebay.com>
	<45f85f70912171217k5227648k72cadc402ba9c78c@mail.gmail.com>
	<82979CF82115834EAB542F32B32FBFC9093CFFA8@RHV-EXM-03.corp.ebay.com>
	<45f85f70912171623g2c4b895fwab6d29e7bb8bbf4c@mail.gmail.com>
	<82979CF82115834EAB542F32B32FBFC9093D0379@RHV-EXM-03.corp.ebay.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Fri, 18 Dec 2009 11:03:24 -0800
Message-ID: <45f85f70912181103t7063e248gda9954ff4c55101@mail.gmail.com>
Subject: Re: Throttling hive queries
To: hive-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636e904c80e7d8c047b0566db

--001636e904c80e7d8c047b0566db
Content-Type: text/plain; charset=ISO-8859-1

Hi Sagi,

Sounds like you need CombineFileInputFormat. See:

https://issues.apache.org/jira/browse/HIVE-74

-Todd

On Fri, Dec 18, 2009 at 10:24 AM, Sagi, Lee <lsagi@shopping.com> wrote:

>  Yes that's true, I have a process that runs and pulls 3 weblogs files
> every hour from 10 servers...10*3*24=720 (not all hours have all the files)
>
>
> Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 |
> Cell: 718-930-7947
>
>
>  ------------------------------
> *From:* Todd Lipcon [mailto:todd@cloudera.com]
> *Sent:* Thursday, December 17, 2009 4:24 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Throttling hive queries
>
> Hi Sagi,
>
> Any chance you're running on a directory that has 614 small files?
>
> -Todd
>
> On Thu, Dec 17, 2009 at 2:30 PM, Sagi, Lee <lsagi@shopping.com> wrote:
>
>>    Todd, Here is the job info.
>>
>>
>>
>> Counter Map Reduce Total File Systems HDFS bytes read 199,115,508 0
>> 199,115,508 HDFS bytes written 0 9,665,472 9,665,472 Local bytes read 0
>> 321,210,205 321,210,205 Local bytes written 204,404,812 321,210,205
>> 525,615,017 Job Counters Launched reduce tasks 0 0 1 Rack-local map tasks
>> 0 0 614 Launched map tasks 0 0 37,130 Data-local map tasks 0 0 36,516
>> org.apache.hadoop.hive.ql.exec.FilterOperator$Counter PASSED 0 10,572
>> 10,572 FILTERED 0 217,305 217,305
>> org.apache.hadoop.hive.ql.exec.MapOperator$Counter DESERIALIZE_ERRORS 0 0
>> 0 Map-Reduce Framework Reduce input groups 0 429,557 429,557 Combine
>> output records 0 0 0 Map input records 429,557 0 429,557 Reduce output
>> records 0 0 0 Map output bytes 201,425,848 0 201,425,848 Map input bytes
>> 199,115,508 0 199,115,508 Map output records 429,557 0 429,557 Combine
>> input records 0 0 0 Reduce input records 0 429,557 429,557
>>
>>
>> Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 |
>> Cell: 718-930-7947
>>
>>
>>  ------------------------------
>> *From:* Todd Lipcon [mailto:todd@cloudera.com]
>> *Sent:* Thursday, December 17, 2009 12:18 PM
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Throttling hive queries
>>
>>   Hi Lee,
>>
>> The MapReduce framework in general makes it hard for you assign fewer
>> mappers than there are blocks in the input data, when using FileInputFormat.
>> Is your input set about 42GB with a 64M block size, or 84G with a 128M block
>> size?
>>
>> -Todd
>>
>> On Thu, Dec 17, 2009 at 11:32 AM, Sagi, Lee <lsagi@shopping.com> wrote:
>>
>>> Here is the query that I am running, just in case someone has an idea of
>>> how to improve it.
>>>
>>> SELECT
>>>      CONCAT(CONCAT('"', PRSS.DATE_KEY), '"'),
>>>      CONCAT(CONCAT('"', PRSC.DATE_KEY), '"'),
>>>      CONCAT(CONCAT('"', PRSS.VOTF_REQUEST_ID), '"'),
>>>      CONCAT(CONCAT('"', PRSC.VOTF_REQUEST_ID), '"'),
>>>      CONCAT(CONCAT('"', PRSS.PRS_REQUEST_ID), '"'),
>>>      CONCAT(CONCAT('"', PRSC.PRS_REQUEST_ID), '"'),
>>>      ...
>>>      ...
>>>      ...
>>>  FROM
>>>      FCT_PRSS PRSS FULL OUTER JOIN FCT_PRSC PRSC ON
>>> (PRSS.PRS_REQUEST_ID = PRSC.PRS_REQUEST_ID)
>>>  WHERE (PRSS.date_key >= '2009121600' AND
>>>        PRSS.date_key < '2009121700') OR
>>>       (PRSC.date_key >= '2009121600' AND
>>>        PRSC.date_key < '2009121700')
>>>
>>>
>>> Lee Sagi | Data Warehouse Tech Lead & Architect | Work: 650-616-6575 |
>>> Cell: 718-930-7947
>>>
>>> -----Original Message-----
>>> From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
>>> Sent: Thursday, December 17, 2009 11:03 AM
>>> To: hive-user@hadoop.apache.org
>>> Subject: Re: Throttling hive queries
>>>
>>>  You should be able
>>>
>>> hive > set mapred.map.tasks=1000
>>> hive > set mapred.reduce.tasks=5
>>>
>>> In some cases mappers is controlled by input files (pre hadoop 20)
>>>
>>>
>>> On Thu, Dec 17, 2009 at 1:58 PM, Sagi, Lee <lsagi@shopping.com> wrote:
>>> > Is there a way to throttle hive queries?
>>> >
>>> > For example, I want to tell hive to not use more then 1000 mappers and
>>>
>>> > 5 reducers for a particular query (or session).
>>> >
>>>
>>
>>
>

--001636e904c80e7d8c047b0566db
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Sagi,<br><br>Sounds like you need CombineFileInputFormat. See:<br><br><a=
 href=3D"https://issues.apache.org/jira/browse/HIVE-74">https://issues.apac=
he.org/jira/browse/HIVE-74</a><br><br>-Todd<br><br><div class=3D"gmail_quot=
e">

On Fri, Dec 18, 2009 at 10:24 AM, Sagi, Lee <span dir=3D"ltr">&lt;<a href=
=3D"mailto:lsagi@shopping.com">lsagi@shopping.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div>
<div dir=3D"ltr" align=3D"left"><span><font size=3D"2" face=3D"Arial" color=
=3D"#0000ff">Yes that&#39;s true, I have a process that runs and pulls 3=20
weblogs files every hour from 10 servers...10*3*24=3D720 (not all hours hav=
e all=20
the files)</font></span></div><div class=3D"im">
<div>=A0</div>
<p><span lang=3D"en-us"><font size=3D"2" face=3D"Tahoma" color=3D"#808080">=
Lee Sagi | Data=20
Warehouse Tech Lead &amp; Architect | Work: 650-616-6575 | Cell:=20
718-930-7947</font></span> </p>
<div>=A0</div><br>
</div><div dir=3D"ltr" align=3D"left" lang=3D"en-us">
<hr>
<font size=3D"2" face=3D"Tahoma"><div class=3D"im"><b>From:</b> Todd Lipcon=
 [mailto:<a href=3D"mailto:todd@cloudera.com" target=3D"_blank">todd@cloude=
ra.com</a>]=20
<br></div><b>Sent:</b> Thursday, December 17, 2009 4:24 PM<div><div></div><=
div class=3D"h5"><br><b>To:</b>=20
<a href=3D"mailto:hive-user@hadoop.apache.org" target=3D"_blank">hive-user@=
hadoop.apache.org</a><br><b>Subject:</b> Re: Throttling hive=20
queries<br></div></div></font><br></div><div><div></div><div class=3D"h5">
<div></div>Hi Sagi,<br><br>Any chance you&#39;re running on a directory tha=
t has 614=20
small files?<br><br>-Todd<br><br>
<div class=3D"gmail_quote">On Thu, Dec 17, 2009 at 2:30 PM, Sagi, Lee <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:lsagi@shopping.com" target=3D"_blank">ls=
agi@shopping.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt=
 0pt 0pt 0.8ex; padding-left: 1ex;" class=3D"gmail_quote">
  <div>
  <div><span style=3D"text-transform: none; text-indent: 0px; border-collap=
se: separate; font-family: &#39;Times New Roman&#39;; font-style: normal; f=
ont-variant: normal; font-weight: normal; font-size: medium; line-height: n=
ormal; white-space: normal; letter-spacing: normal; color: rgb(0, 0, 0); wo=
rd-spacing: 0px; font-size-adjust: none; font-stretch: normal;"><span style=
=3D"font-family: sans-serif;">
  <table>
    <tbody>
    <tr>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;"><font size=3D"2" color=3D"#0000ff"><span>Todd, Here is the job=
=20
    info.</span></font></th></tr></tbody></table></span></span></div>
  <div><span style=3D"text-transform: none; text-indent: 0px; border-collap=
se: separate; font-family: &#39;Times New Roman&#39;; font-style: normal; f=
ont-variant: normal; font-weight: normal; font-size: medium; line-height: n=
ormal; white-space: normal; letter-spacing: normal; color: rgb(0, 0, 0); wo=
rd-spacing: 0px; font-size-adjust: none; font-stretch: normal;"><span style=
=3D"font-family: sans-serif;"><font size=3D"2" color=3D"#0000ff"></font></s=
pan></span>=A0</div>


  <div dir=3D"ltr" align=3D"left"><br></div><span style=3D"text-transform: =
none; text-indent: 0px; border-collapse: separate; font-family: &#39;Times =
New Roman&#39;; font-style: normal; font-variant: normal; font-weight: norm=
al; font-size: medium; line-height: normal; white-space: normal; letter-spa=
cing: normal; color: rgb(0, 0, 0); word-spacing: 0px; font-size-adjust: non=
e; font-stretch: normal;"><span style=3D"font-family: sans-serif;"><span st=
yle=3D"text-transform: none; text-indent: 0px; border-collapse: separate; f=
ont-family: &#39;Times New Roman&#39;; font-style: normal; font-variant: no=
rmal; font-weight: normal; font-size: medium; line-height: normal; white-sp=
ace: normal; letter-spacing: normal; color: rgb(0, 0, 0); word-spacing: 0px=
; font-size-adjust: none; font-stretch: normal;"><span style=3D"font-family=
: sans-serif;">
  <div dir=3D"ltr" align=3D"left">
  <table border=3D"2" cellpadding=3D"5" cellspacing=3D"2">
    <tbody>
    <tr>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;"><br></th>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;">Counter</th>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;">Map</th>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;">Reduce</th>
      <th style=3D"border-bottom-style: none; padding-bottom: 4px; padding-=
top: 4px;">Total</th></tr>
    <tr>
      <td rowspan=3D"4">File Systems</td>
      <td>HDFS bytes read</td>
      <td align=3D"right">199,115,508</td>
      <td align=3D"right">0</td>
      <td align=3D"right">199,115,508</td></tr>
    <tr>
      <td>HDFS bytes written</td>
      <td align=3D"right">0</td>
      <td align=3D"right">9,665,472</td>
      <td align=3D"right">9,665,472</td></tr>
    <tr>
      <td>Local bytes read</td>
      <td align=3D"right">0</td>
      <td align=3D"right">321,210,205</td>
      <td align=3D"right">321,210,205</td></tr>
    <tr>
      <td>Local bytes written</td>
      <td align=3D"right">204,404,812</td>
      <td align=3D"right">321,210,205</td>
      <td align=3D"right">525,615,017</td></tr>
    <tr>
      <td rowspan=3D"4">Job Counters</td>
      <td>Launched reduce tasks</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">1</td></tr>
    <tr>
      <td>Rack-local map tasks</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">614</td></tr>
    <tr>
      <td>Launched map tasks</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">37,130</td></tr>
    <tr>
      <td>Data-local map tasks</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">36,516</td></tr>
    <tr>
      <td rowspan=3D"2">org.apache.hadoop.hive.ql.exec.FilterOperator$Count=
er</td>
      <td>PASSED</td>
      <td align=3D"right">0</td>
      <td align=3D"right">10,572</td>
      <td align=3D"right">10,572</td></tr>
    <tr>
      <td>FILTERED</td>
      <td align=3D"right">0</td>
      <td align=3D"right">217,305</td>
      <td align=3D"right">217,305</td></tr>
    <tr>
      <td>org.apache.hadoop.hive.ql.exec.MapOperator$Counter</td>
      <td>DESERIALIZE_ERRORS</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td></tr>
    <tr>
      <td rowspan=3D"9">Map-Reduce Framework</td>
      <td>Reduce input groups</td>
      <td align=3D"right">0</td>
      <td align=3D"right">429,557</td>
      <td align=3D"right">429,557</td></tr>
    <tr>
      <td>Combine output records</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td></tr>
    <tr>
      <td>Map input records</td>
      <td align=3D"right">429,557</td>
      <td align=3D"right">0</td>
      <td align=3D"right">429,557</td></tr>
    <tr>
      <td>Reduce output records</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td></tr>
    <tr>
      <td>Map output bytes</td>
      <td align=3D"right">201,425,848</td>
      <td align=3D"right">0</td>
      <td align=3D"right">201,425,848</td></tr>
    <tr>
      <td>Map input bytes</td>
      <td align=3D"right">199,115,508</td>
      <td align=3D"right">0</td>
      <td align=3D"right">199,115,508</td></tr>
    <tr>
      <td>Map output records</td>
      <td align=3D"right">429,557</td>
      <td align=3D"right">0</td>
      <td align=3D"right">429,557</td></tr>
    <tr>
      <td>Combine input records</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td>
      <td align=3D"right">0</td></tr>
    <tr>
      <td>Reduce input records</td>
      <td align=3D"right">0</td>
      <td align=3D"right">429,557</td>
      <td align=3D"right">429,557</td></tr></tbody></table></div></span></s=
pan></span></span>
  <div>
  <div>=A0</div>
  <p><span lang=3D"en-us"><font size=3D"2" face=3D"Tahoma" color=3D"#808080=
">Lee Sagi | Data=20
  Warehouse Tech Lead &amp; Architect | Work: 650-616-6575 | Cell:=20
  718-930-7947</font></span> </p>
  <div>=A0</div><br></div>
  <div dir=3D"ltr" align=3D"left" lang=3D"en-us">
  <hr>
  <font size=3D"2" face=3D"Tahoma"><b>From:</b> Todd Lipcon [mailto:<a href=
=3D"mailto:todd@cloudera.com" target=3D"_blank">todd@cloudera.com</a>]=20
  <br><b>Sent:</b> Thursday, December 17, 2009 12:18 PM
  <div>
  <div></div>
  <div><br><b>To:</b> <a href=3D"mailto:hive-user@hadoop.apache.org" target=
=3D"_blank">hive-user@hadoop.apache.org</a><br><b>Subject:</b> Re:=20
  Throttling hive queries<br></div></div></font><br></div>
  <div>
  <div></div>
  <div>
  <div></div>Hi Lee,<br><br>The MapReduce framework in general makes it har=
d for=20
  you assign fewer mappers than there are blocks in the input data, when us=
ing=20
  FileInputFormat. Is your input set about 42GB with a 64M block size, or 8=
4G=20
  with a 128M block size?<br><br>-Todd<br><br>
  <div class=3D"gmail_quote">On Thu, Dec 17, 2009 at 11:32 AM, Sagi, Lee <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:lsagi@shopping.com" target=3D"_blank"=
>lsagi@shopping.com</a>&gt;</span> wrote:<br>
  <blockquote style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0=
pt 0pt 0pt 0.8ex; padding-left: 1ex;" class=3D"gmail_quote">Here is the que=
ry that I am running, just in case someone=20
    has an idea of<br>how to improve it.<br><br>SELECT<br>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSS.DATE_KEY), &#39;&quot;&#39;),<b=
r>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSC.DATE_KEY), &#39;&quot;&#39;),<b=
r>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSS.VOTF_REQUEST_ID), &#39;&quot;&#=
39;),<br>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSC.VOTF_REQUEST_ID), &#39;&quot;&#=
39;),<br>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSS.PRS_REQUEST_ID), &#39;&quot;=
9;),<br>=A0 =A0=20
    =A0CONCAT(CONCAT(&#39;&quot;&#39;, PRSC.PRS_REQUEST_ID), &#39;&quot;=
9;),<br>=A0 =A0=20
    =A0...<br>=A0 =A0 =A0...<br>=A0 =A0=20
    =A0...<br>=A0FROM<br>=A0 =A0 =A0FCT_PRSS PRSS FULL OUTER JOIN=20
    FCT_PRSC PRSC ON<br>(PRSS.PRS_REQUEST_ID =3D=20
    PRSC.PRS_REQUEST_ID)<br>=A0WHERE (PRSS.date_key &gt;=3D &#39;2009121600=
&#39;=20
    AND<br>=A0 =A0 =A0 =A0PRSS.date_key &lt; &#39;2009121700&#39;)=20
    OR<br>=A0 =A0 =A0 (PRSC.date_key &gt;=3D &#39;2009121600&#39; AND<br>=
=A0=20
    =A0 =A0 =A0PRSC.date_key &lt; &#39;2009121700&#39;)<br>
    <div><br><br>Lee Sagi | Data Warehouse Tech Lead &amp; Architect | Work=
:=20
    650-616-6575 |<br>Cell: 718-930-7947<br><br>-----Original=20
    Message-----<br>From: Edward Capriolo [mailto:<a href=3D"mailto:edlinux=
guru@gmail.com" target=3D"_blank">edlinuxguru@gmail.com</a>]<br>Sent: Thurs=
day, December 17,=20
    2009 11:03 AM<br>To: <a href=3D"mailto:hive-user@hadoop.apache.org" tar=
get=3D"_blank">hive-user@hadoop.apache.org</a><br>Subject: Re: Throttling=
=20
    hive queries<br><br></div>
    <div>
    <div></div>
    <div>You should be able<br><br>hive &gt; set mapred.map.tasks=3D1000<br=
>hive=20
    &gt; set mapred.reduce.tasks=3D5<br><br>In some cases mappers is contro=
lled by=20
    input files (pre hadoop 20)<br><br><br>On Thu, Dec 17, 2009 at 1:58 PM,=
=20
    Sagi, Lee &lt;<a href=3D"mailto:lsagi@shopping.com" target=3D"_blank">l=
sagi@shopping.com</a>&gt; wrote:<br>&gt; Is there a way to=20
    throttle hive queries?<br>&gt;<br>&gt; For example, I want to tell hive=
 to=20
    not use more then 1000 mappers and<br><br>&gt; 5 reducers for a particu=
lar=20
    query (or=20
  session).<br>&gt;<br></div></div></blockquote></div><br></div></div></div=
></blockquote></div><br></div></div></div>
</blockquote></div><br>

--001636e904c80e7d8c047b0566db--