Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of yaron.gonen@gmail.com
 designates 209.85.214.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr2QzZP6F10N2C5ETh77vUAkNARxQpSfuKNgFhgb92feUQ@mail.gmail.com>
References: 
 <CAKj4Onw-SZQ0fWNpB9khKKNFSY2_BEnMVkDdiSqT5xGhdU8J9A@mail.gmail.com>
	<CAOcnVr3zm3_JVH+vUUfhSsPCohvOZbW2DPWmvBhcSFr2xD-+qA@mail.gmail.com>
	<CAKj4OnxkE1fBcVZ-pcP6kMOUmV4DZGkAj5NtuJU3+bn78GH-SQ@mail.gmail.com>
	<CAOcnVr2QzZP6F10N2C5ETh77vUAkNARxQpSfuKNgFhgb92feUQ@mail.gmail.com>
Date: Mon, 6 Aug 2012 10:23:19 +0300
Message-ID: 
 <CAKj4Onzf-=gjNdZWneT8GSWqjrUpCNtQAjynu-mQYoxotES8wQ@mail.gmail.com>
Subject: Re: Keeping Map-Tasks alive
From: Yaron Gonen <yaron.gonen@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015175df17090b45004c693c0fd

--0015175df17090b45004c693c0fd
Content-Type: text/plain; charset=ISO-8859-1

Thanks.
As I see it, it cannot be done in the MapReduce 1 framework without
changing TaskTracker and JobTracker.
Problem is I'm not familiar at all with YARN... it might be possible there.
Thanks again!

On Mon, Aug 6, 2012 at 1:21 AM, Harsh J <harsh@cloudera.com> wrote:

> Ah, my bad - I skipped over the K-Means part of your original post.
>
> There currently isn't a way to do this with the existing MR framework and
> APIs. A Reducer is initiated upon map completion and the Task JVM is canned
> away after the Maps end. Perhaps you can use YARN to write something of
> what you desire?
>
>
> On Mon, Aug 6, 2012 at 12:11 AM, Yaron Gonen <yaron.gonen@gmail.com>wrote:
>
>> Thanks for the fast reply, but I don't see how a custom record reader
>> will help.
>> Consider again the k-means: the mappers need to stand-by until all the
>> reducers finish to calculate the new clusters' center. Only then, after the
>> reducers finish their work, the stand-by mappers get back to life and
>> perform their work.
>>
>>
>> On Sun, Aug 5, 2012 at 7:49 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>>> Sure you can, as we provide pluggable code points via the API. Just
>>> write a custom record reader that doubles the work (first round reads
>>> actual input, second round reads your known output and reiterates). In the
>>> mapper, separate the first and second logic via a flag.
>>>
>>>
>>> On Sun, Aug 5, 2012 at 4:17 PM, Yaron Gonen <yaron.gonen@gmail.com>wrote:
>>>
>>>> Hi,
>>>> Is there a way to keep a map-task alive after it has finished its work,
>>>> to later perform another task on its same input?
>>>> For example, consider the k-means clustering algorithm (k-means
>>>> description <http://en.wikipedia.org/wiki/K-means_clustering> and hadoop
>>>> implementation<http://codingwiththomas.blogspot.co.il/2011/05/k-means-clustering-with-mapreduce.html>).
>>>> The only thing changing between iterations is the clusters centers. All the
>>>> input points remain the same. Keeping the mapper alive, and performing the
>>>> next round of map-tasks on the same node will save a lot of communication
>>>> cost.
>>>>
>>>> Thanks,
>>>> Yaron
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Harsh J
>

--0015175df17090b45004c693c0fd
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks.<br>As I see it, it cannot be done in the MapReduce=
 1 framework without changing <span style=3D"font-family:courier new,monosp=
ace">TaskTracker</span> and <span style=3D"font-family:courier new,monospac=
e">JobTracker</span>.<br>
Problem is I&#39;m not familiar at all with YARN... it might be possible th=
ere.<br>Thanks again!<br><br><div class=3D"gmail_quote">On Mon, Aug 6, 2012=
 at 1:21 AM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera=
.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Ah, my bad - I skipped over the K-Means part=
 of your original post.<div><br></div><div>There currently isn&#39;t a way =
to do this with the existing MR framework and APIs. A Reducer is initiated =
upon map completion and the Task JVM is canned away after the Maps end. Per=
haps you can use YARN to write something of what you desire?<div>
<div class=3D"h5"><br>

<br><div class=3D"gmail_quote">On Mon, Aug 6, 2012 at 12:11 AM, Yaron Gonen=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:yaron.gonen@gmail.com" target=3D"_=
blank">yaron.gonen@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">


<div dir=3D"ltr">Thanks for the fast reply, but I don&#39;t see how a custo=
m record reader will help.<br>Consider again the k-means: the mappers need =
to stand-by until all the reducers finish to calculate the new clusters&#39=
; center. Only then, after the reducers finish their work, the stand-by map=
pers get back to life and perform their work.<div>


<div><br>
<br><div class=3D"gmail_quote">On Sun, Aug 5, 2012 at 7:49 PM, Harsh J <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_blank">h=
arsh@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Sure you can, as we provide pluggable code points via the API. Just write a=
 custom record reader that doubles the work (first round reads actual input=
, second round reads your known output and reiterates). In the mapper, sepa=
rate the first and second logic via a flag.<div>


<div><br>

<br><div class=3D"gmail_quote">On Sun, Aug 5, 2012 at 4:17 PM, Yaron Gonen =
<span dir=3D"ltr">&lt;<a href=3D"mailto:yaron.gonen@gmail.com" target=3D"_b=
lank">yaron.gonen@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">


<div dir=3D"ltr"><div>Hi,</div><div>Is there a way to keep a map-task alive=
 after it has finished its work, to later perform another task on its same =
input?</div><div>For example, consider the k-means clustering algorithm (<a=
 href=3D"http://en.wikipedia.org/wiki/K-means_clustering" target=3D"_blank"=
>k-means description</a> and <a href=3D"http://codingwiththomas.blogspot.co=
.il/2011/05/k-means-clustering-with-mapreduce.html" target=3D"_blank">hadoo=
p implementation</a>). The only thing changing between iterations is the cl=
usters centers. All the input points remain the same. Keeping the mapper al=
ive, and performing the next round of map-tasks on the same node will save =
a lot of communication cost.</div>


<div><br></div><div>Thanks,</div><div>Yaron</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Harsh J<br>
</font></span></blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"HOEnZb"><font color=3D"#888888">-- <br>Harsh J<br>
</font></span></div>
</blockquote></div><br></div>

--0015175df17090b45004c693c0fd--