Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of manish.hadoop.work@gmail.com
 designates 209.85.213.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABBNW9N0Tt7Ukauoi0908JZMk-hcRfYHGd5Et83+K-WQivM5zw@mail.gmail.com>
References: 
 <CAFAmJqdG5R0uWQnuB_0YxffnbdyjTVy-SUwF3gzfbZuiv8k0mA@mail.gmail.com>
	<CF7ADCA3.31A1%sanjay.subramanian@roberthalf.com>
	<CABBNW9N0Tt7Ukauoi0908JZMk-hcRfYHGd5Et83+K-WQivM5zw@mail.gmail.com>
Date: Sun, 27 Apr 2014 15:00:32 -0700
Message-ID: 
 <CAJ_ejR8omERJHOs318+1+WSiwonf=+df=NXw=as2bTTD7rSmcw@mail.gmail.com>
Subject: Re: Executing Hive Queries in Parallel
From: Manish Malhotra <manish.hadoop.work@gmail.com>
To: Hive <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=001a11c2badae2f04404f80d5390

--001a11c2badae2f04404f80d5390
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

What Sanjay and Swagatika replied are perfect.

Plus fundamentally if you see, if you are able to run the hive query from
CLI or some internal API like HiveDriver, the flow will be this:

>> Compile the query
>> Get the info from Hive Metastore using Thrift or JDBC, Optimize it ( if
required and can do)
>> Generate the Java MR code.
>> Push the jobs ( might need to execute more then 1 in sequence) to the
JobTracker
Now the final step make sure that these MR job runs in parallel based on
the Queue and availability of the MR slots on the cluster.

So, irrespective you are running query using nohup hive -q or from multiple
machines or Oozie or Your custom code.
It boils down to your system/code is not submitting query in sequence or
not waiting and your cluster has enough resource to run MR in parallel.

Regards,
Manish


On Sun, Apr 27, 2014 at 1:58 PM, Swagatika Tripathy <swagatikat856@gmail.co=
m
> wrote:

> Hi,
> You can also use oozie's fork fearure  which acts as a workflow scheduler
> to run jobs in parallel. You just need to define all our hql's inside the
> workflow.XML to make it run in parallel.
> On Apr 22, 2014 3:14 AM, "Subramanian, Sanjay (HQP)" <
> sanjay.subramanian@roberthalf.com> wrote:
>
>>   Hey
>>
>>  Instead of going into HIVE CLI
>>  I would propose 2 ways
>>
>>  *NOHUP *
>>  nohup hive -f path/to/query/file/*hive1.hql* >> ./hive1.hql_`date
>> +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive2.hql* >> ./hive2.hql_`date
>> +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive3.hql* >> ./hive3.hql_`date
>> +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive4.hql* >> ./hive4.hql_`date
>> +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive5.hql* >> ./hive5.hql_`date
>> +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log 2>&1
>>
>>  Each statement above will launch MR jobs on your cluster and depending
>> on the cluster configs the jobs will run parallelly
>>  Scheduling jobs on the MR cluster is independent of Hive
>>
>>  *SCREEN sessions*
>>
>>    - Create a Screen session
>>       - screen  =E2=80=93S  hive_query1
>>       - U r inside the screen session hive_query1
>>          - hive -f path/to/query/file/*hive1.hql*
>>       - Ctrl A D
>>          - U detach from a screen session
>>        - Repeat for each hive query u want to run
>>       - I.e. Say 5 screen sessions, each running a have query
>>    - To display screen session active
>>       - screen -x
>>    - To attach to a screen session
>>       - screen  -x hive_query1
>>
>>
>>  Thanks
>>
>> Warm Regards
>>
>>
>>  Sanjay
>>
>>
>>    From: saurabh <mpp.databases@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Monday, April 21, 2014 at 1:53 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Executing Hive Queries in Parallel
>>
>>
>>  Hi,
>>  I need some inputs to execute hive queries in parallel. I tried doing
>> this using CLI (by opening multiple ssh connection) and executed 4 HQL's=
;
>> it was observed that the queries are getting executed sequentially. All =
the
>> FOUR queries got submitted however while the first one was in execution
>> mode the other were in pending state. I was performing this activity on =
the
>> EMR running on Batch mode hence didn't able to dig into the logs.
>>
>>  The hive CLI uses native hive connection which by default uses the FIFO
>> scheduler.  This might be one of the reason for the queries getting
>> executed in sequence.
>>
>>  I also observed that when multiple queries are executed using multiple
>> HUE sessions, it provides the parallel execution functionality. Can you
>> please suggest how the functionality of HUE can be replicated using CLI?
>>
>>  I am aware of beeswax client however i am not sure how this can be used
>> during EMR- batch mode processing.
>>
>>  Thanks in advance for going through this. Kindly let me know your
>> thoughts on the same.
>>
>>

--001a11c2badae2f04404f80d5390
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">What Sanjay and Swagatika replied are perfect.=C2=A0<div><=
br></div><div>Plus fundamentally if you see, if you are able to run the hiv=
e query from CLI or some internal API like HiveDriver, the flow will be thi=
s:</div>
<div><br></div><div>&gt;&gt; Compile the query</div><div>&gt;&gt; Get the i=
nfo from Hive Metastore using Thrift or JDBC, Optimize it ( if required and=
 can do)</div><div>&gt;&gt; Generate the Java MR code.=C2=A0</div><div>&gt;=
&gt; Push the jobs ( might need to execute more then 1 in sequence) to the =
JobTracker=C2=A0</div>
<div>Now the final step make sure that these MR job runs in parallel based =
on the Queue and availability of the MR slots on the cluster.=C2=A0</div><d=
iv><br></div><div>So, irrespective you are running query using nohup hive -=
q or from multiple machines or Oozie or Your custom code.=C2=A0</div>
<div>It boils down to your system/code is not submitting query in sequence =
or not waiting and your cluster has enough resource to run MR in parallel.=
=C2=A0</div><div><br></div><div>Regards,</div><div>Manish</div><div><br></d=
iv>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sun,=
 Apr 27, 2014 at 1:58 PM, Swagatika Tripathy <span dir=3D"ltr">&lt;<a href=
=3D"mailto:swagatikat856@gmail.com" target=3D"_blank">swagatikat856@gmail.c=
om</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p>Hi,<br>
You can also use oozie&#39;s fork fearure=C2=A0 which acts as a workflow sc=
heduler to run jobs in parallel. You just need to define all our hql&#39;s =
inside the workflow.XML to make it run in parallel. </p><div class=3D"HOEnZ=
b">
<div class=3D"h5">
<div class=3D"gmail_quote">On Apr 22, 2014 3:14 AM, &quot;Subramanian, Sanj=
ay (HQP)&quot; &lt;<a href=3D"mailto:sanjay.subramanian@roberthalf.com" tar=
get=3D"_blank">sanjay.subramanian@roberthalf.com</a>&gt; wrote:<br type=3D"=
attribution">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div style=3D"word-wrap:break-word">
<div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
Hey=C2=A0</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
Instead of going into HIVE CLI=C2=A0</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
I would propose 2 ways=C2=A0</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<b>NOHUP=C2=A0</b></div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div><span><span style=3D"font-size:16px;font-family:Cousine;vertical-align=
:baseline;white-space:pre-wrap">nohup hive -f path/to/query/file/<b>hive1.h=
ql</b> &gt;&gt; ./hive1.hql_`date +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log
 2&gt;&amp;1</span></span></div>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div><span><span style=3D"font-size:16px;font-family:Cousine;vertical-align=
:baseline;white-space:pre-wrap">nohup hive -f path/to/query/file/<b>hive2.h=
ql</b> &gt;&gt; ./hive2.hql_`date +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log
 2&gt;&amp;1</span></span></div>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div><span><span style=3D"font-size:16px;font-family:Cousine;vertical-align=
:baseline;white-space:pre-wrap">nohup hive -f path/to/query/file/<b>hive3.h=
ql</b> &gt;&gt; ./hive3.hql_`date +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log
 2&gt;&amp;1</span></span></div>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div><span><span style=3D"font-size:16px;font-family:Cousine;vertical-align=
:baseline;white-space:pre-wrap">nohup hive -f path/to/query/file/<b>hive4.h=
ql</b> &gt;&gt; ./hive4.hql_`date +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log
 2&gt;&amp;1</span></span></div>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div><span><span style=3D"font-size:16px;font-family:Cousine;vertical-align=
:baseline;white-space:pre-wrap">nohup hive -f path/to/query/file/<b>hive5.h=
ql</b> &gt;&gt; ./hive5.hql_`date +%Y-%m-%d-%H=E2=80=93%M=E2=80=93%S`.log
 2&gt;&amp;1</span></span></div>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<span><span style=3D"font-size:16px;font-family:Cousine;vertical-align:base=
line;white-space:pre-wrap"><br>
</span></span></div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
Each statement above will launch MR jobs on your cluster and depending on t=
he cluster configs the jobs will run parallelly</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
Scheduling jobs on the MR cluster is independent of Hive=C2=A0</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<b>SCREEN sessions</b></div>
<ul>
<li style=3D"font-size:14px;font-family:Calibri,sans-serif">
Create a Screen session=C2=A0
<ul>
<li style=3D"font-size:14px;font-family:Calibri,sans-serif">
screen =C2=A0=E2=80=93S =C2=A0hive_query1</li><li style=3D"font-size:14px;f=
ont-family:Calibri,sans-serif">
U r inside the screen session hive_query1=C2=A0
<ul style=3D"font-size:14px;font-family:Calibri,sans-serif">
<li><span style=3D"font-family:Cousine;font-size:16px;white-space:pre-wrap"=
>hive -f path/to/query/file/</span><b style=3D"font-family:Cousine;font-siz=
e:16px;white-space:pre-wrap">hive1.hql</b></li></ul>
</li><li style=3D"font-size:14px;font-family:Calibri,sans-serif">
<span style=3D"font-family:Cousine;font-size:16px;white-space:pre-wrap">Ctr=
l A D</span>
<ul>
<li><font face=3D"Cousine"><span style=3D"font-size:16px">U detach from a s=
creen session</span></font></li></ul>
</li></ul>
</li><li style=3D"font-size:14px;font-family:Calibri,sans-serif">
Repeat for each hive query u want to run
<ul>
<li style=3D"font-size:14px;font-family:Calibri,sans-serif">
I.e. Say 5 screen sessions, each running a have query =C2=A0</li></ul>
</li><li style=3D"font-size:14px;font-family:Calibri,sans-serif">
To display screen session active=C2=A0
<ul>
<li style=3D"font-size:14px;font-family:Calibri,sans-serif">
screen -x</li></ul>
</li><li style=3D"font-size:14px;font-family:Calibri,sans-serif">
To attach to a screen session
<ul>
<li style=3D"font-size:14px;font-family:Calibri,sans-serif">
screen =C2=A0-x=C2=A0hive_query1</li></ul>
</li></ul>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">
<p class=3D"MsoNormal" style=3D"margin:0in 0in 0.0001pt;font-size:11pt"><sp=
an style=3D"color:rgb(31,73,125)">Thanks</span></p>
<p class=3D"MsoNormal" style=3D"margin:0in 0in 0.0001pt;font-size:11pt"><sp=
an style=3D"color:rgb(31,73,125)">Warm Regards</span></p>
<p class=3D"MsoNormal" style=3D"margin:0in 0in 0.0001pt;font-size:11pt"><sp=
an style=3D"color:rgb(31,73,125)"><br>
</span></p>
<p class=3D"MsoNormal" style=3D"margin:0in 0in 0.0001pt;font-size:11pt"><sp=
an style=3D"color:rgb(31,73,125)">Sanjay</span></p>
<p class=3D"MsoNormal" style=3D"margin:0in 0in 0.0001pt;font-size:11pt"><br=
>
</p>
</div>
</div>
<span style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div style=3D"border-right:medium none;padding-right:0in;padding-left:0in;p=
adding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;fon=
t-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-lef=
t:medium none">


<span style=3D"font-weight:bold">From: </span>saurabh &lt;<a href=3D"mailto=
:mpp.databases@gmail.com" target=3D"_blank">mpp.databases@gmail.com</a>&gt;=
<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@hive.apache.org" target=3D"_blank">user@hive.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hive.apache.org" target=3D"_blank">user@hive.apache.o=
rg</a>&gt;<br>


<span style=3D"font-weight:bold">Date: </span>Monday, April 21, 2014 at 1:5=
3 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@hi=
ve.apache.org" target=3D"_blank">user@hive.apache.org</a>&quot; &lt;<a href=
=3D"mailto:user@hive.apache.org" target=3D"_blank">user@hive.apache.org</a>=
&gt;<br>


<span style=3D"font-weight:bold">Subject: </span>Executing Hive Queries in =
Parallel<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div><br>
</div>
Hi, <br>
</div>
I need some inputs to execute hive queries in parallel. I tried doing this =
using CLI (by opening multiple ssh connection) and executed 4 HQL&#39;s; it=
 was observed that the queries are getting executed sequentially. All the F=
OUR queries got submitted however while
 the first one was in execution mode the other were in pending state. I was=
 performing this activity on the EMR running on Batch mode hence didn&#39;t=
 able to dig into the logs.
<br>
<br>
</div>
The hive CLI uses native hive connection which by default uses the FIFO sch=
eduler.=C2=A0 This might be one of the reason for the queries getting execu=
ted in sequence.
<br>
<br>
</div>
I also observed that when multiple queries are executed using multiple HUE =
sessions, it provides the parallel execution functionality. Can you please =
suggest how the functionality of HUE can be replicated using CLI?<br>
<br>
</div>
I am aware of beeswax client however i am not sure how this can be used dur=
ing EMR- batch mode processing.
<br>
<br>
</div>
Thanks in advance for going through this. Kindly let me know your thoughts =
on the same.
<br>
<br>
</div>
</div>
</div>
</div>
</span>
</div>

</blockquote></div>
</div></div></blockquote></div><br></div>

--001a11c2badae2f04404f80d5390--