Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
From: Sreekanth Ramakrishnan <sreerama@yahoo-inc.com>
To: Rosanna Man <rosanna@auditude.com>,
        "user@hive.apache.org"
	<user@hive.apache.org>
Date: Mon, 2 May 2011 09:12:04 +0530
Subject: Re: Using capacity scheduler 
Thread-Topic: Using capacity scheduler 
Thread-Index: AcwF0J/YowC5ZpWceEuYTyT04XGKjQASjGfxACLFNRoAdUAoGA==
Message-ID: <C9E425E4.ADCD%sreerama@yahoo-inc.com>
In-Reply-To: <C9E06341.198D%rosanna@auditude.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C9E425E4ADCDsreeramayahooinccom_"
MIME-Version: 1.0

--_000_C9E425E4ADCDsreeramayahooinccom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


The design goal of CapacityScheduler is maximizing the utilization of clust=
er resources but it does not fairly allocate the share amongst the total nu=
mber of users present in the system.

The user limit states the number of concurrent users who can use the slots =
in the queue. But then these limits are elastic in nature, as there is no p=
reemption as the slots get freed up the new tasks will be allotted those sl=
ot to meet the user limit.

In order for your requirement, you can possibly submit the large tasks to a=
 queue which have max task limit set, so your long running jobs don't take =
up whole of the cluster capacity and submit shorter, smaller jobs to fast m=
oving queue with something like 10% user limit which allows 10 concurrent u=
ser per queue.

The actual distribution of the of the capacity across longer/shorter jobs d=
epends on your workload.


On 4/30/11 1:14 AM, "Rosanna Man" <rosanna@auditude.com> wrote:

Hi Sreekanth,

Thank you very much for your clarification. Setting the max task limits on =
queues will work but can we do something on the max user limit? Is it pre-e=
mptible also? We are exploring about the possibility of running the queries=
 with different users for capacity scheduler to maximize the use of the res=
ources.

Basically, our goal is to maximize the resources (mappers and reducers) whi=
le providing a fair share to the short tasks while a big task is running. H=
ow do you normally achieve hat?

Thanks,
Rosanna

On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sreerama@yahoo-inc.com> wrote=
:

Hi

Currently CapacityScheduler does not have pre-emption. So basically when th=
e Job1 starts finishing and freeing up the Job2's tasks will start getting =
scheduled. One way you can prevent that queue capacities are not elastic in=
 nature is by setting max task limits on queues. That way your job1 will ne=
ver execeed first queues capacity


On 4/28/11 11:48 PM, "Rosanna Man" <rosanna@auditude.com> wrote:

Hi all,

We are using capacity scheduler to schedule resources among different queue=
s for 1 user (hadoop) only. We have set the queues to have equal share of t=
he resources. However, when 1st task starts in the first queue and is consu=
ming all the resources, the 2nd task starts in the 2nd queue will be starve=
d from reducer until the first task finished. A lot of processing is being =
stuck when a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler befo=
re but it gives an error when the mapper gives no output (which is fine in =
our use cases).

Anyone can give us some advice?

Thanks,
Rosanna


--
Sreekanth Ramakrishnan

--_000_C9E425E4ADCDsreeramayahooinccom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: Using capacity scheduler </TITLE>
</HEAD>
<BODY>
<FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:=
11pt'><BR>
The design goal of CapacityScheduler is maximizing the utilization of clust=
er resources but it does not fairly allocate the share amongst the total nu=
mber of users present in the system.<BR>
<BR>
The user limit states the number of concurrent users who can use the slots =
in the queue. But then these limits are elastic in nature, as there is no p=
reemption as the slots get freed up the new tasks will be allotted those sl=
ot to meet the user limit.<BR>
<BR>
In order for your requirement, you can possibly submit the large tasks to a=
 queue which have max task limit set, so your long running jobs don&#8217;t=
 take up whole of the cluster capacity and submit shorter, smaller jobs to =
fast moving queue with something like 10% user limit which allows 10 concur=
rent user per queue.<BR>
<BR>
The actual distribution of the of the capacity across longer/shorter jobs d=
epends on your workload.<BR>
&nbsp;<BR>
<BR>
On 4/30/11 1:14 AM, &quot;Rosanna Man&quot; &lt;<a href=3D"rosanna@auditude=
.com">rosanna@auditude.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><SPAN STYLE=3D'font-size:11pt'>Hi Sreekanth,<BR>
<BR>
Thank you very much for your clarification. Setting the max task limits on =
queues will work but can we do something on the max user limit? Is it pre-e=
mptible also? We are exploring about the possibility of running the queries=
 with different users for capacity scheduler to maximize the use of the res=
ources.<BR>
<BR>
Basically, our goal is to maximize the resources (mappers and reducers) whi=
le providing a fair share to the short tasks while a big task is running. H=
ow do you normally achieve hat?<BR>
<BR>
Thanks,<BR>
Rosanna<BR>
<BR>
On 4/28/11 8:09 PM, &quot;Sreekanth Ramakrishnan&quot; &lt;<a href=3D"sreer=
ama@yahoo-inc.com">sreerama@yahoo-inc.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><SPAN STYLE=3D'font-size:11pt'>Hi<BR>
<BR>
Currently CapacityScheduler does not have pre-emption. So basically when th=
e Job1 starts finishing and freeing up the Job2&#8217;s tasks will start ge=
tting scheduled. One way you can prevent that queue capacities are not elas=
tic in nature is by setting max task limits on queues. That way your job1 w=
ill never execeed first queues capacity<BR>
&nbsp;&nbsp;&nbsp;&nbsp;<BR>
<BR>
<BR>
<BR>
On 4/28/11 11:48 PM, &quot;Rosanna Man&quot; &lt;<a href=3D"rosanna@auditud=
e.com">rosanna@auditude.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><SPAN STYLE=3D'font-size:11pt'>Hi all,<BR>
<BR>
We are using capacity scheduler to schedule resources among different queue=
s for 1 user (hadoop) only. We have set the queues to have equal share of t=
he resources. However, when 1st task starts in the first queue and is consu=
ming all the resources, the 2nd task starts in the 2nd queue will be starve=
d from reducer until the first task finished. A lot of processing is being =
stuck when a large query is executing.<BR>
<BR>
We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler befo=
re but it gives an error when the mapper gives no output (which is fine in =
our use cases).<BR>
<BR>
Anyone can give us some advice?<BR>
<BR>
Thanks,<BR>
Rosanna<BR>
</SPAN></FONT></BLOCKQUOTE></BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Hel=
vetica, Arial"><SPAN STYLE=3D'font-size:11pt'><BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial=
"><SPAN STYLE=3D'font-size:11pt'><BR>
-- <BR>
Sreekanth Ramakrishnan<BR>
</SPAN></FONT>
</BODY>
</HTML>


--_000_C9E425E4ADCDsreeramayahooinccom_--