Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <024301ce2928$b5ae6920$210b3b60$@yahoo.com>
References: 
 <AEC01171C2A59D468263704BA19E3DBE487D3077@CH1PRD0102MB158.prod.exchangelabs.com>
 <024301ce2928$b5ae6920$210b3b60$@yahoo.com>
From: Ted Dunning <tdunning@maprtech.com>
Date: Mon, 25 Mar 2013 08:26:14 +0100
Message-ID: 
 <CAND0qzvUPXi3Z2jXCoAPj7exvRo3HsTvfgK_n1aJpEVc=2AV4Q@mail.gmail.com>
Subject: Re:
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=e89a8f6474a77eab0704d8bab9fa

--e89a8f6474a77eab0704d8bab9fa
Content-Type: text/plain; charset=ISO-8859-1

I would agree with David that this is not normally a good idea.

There are situations, however, where you do need to control location of
data and where the computation occurs.  These requirements, however,
normally only come up in real-time or low-latency situations.

Ordinary Hadoop does not address those needs, by design.  This allows
Hadoop to have a much simpler implementation and to handle a varied batch
oriented workload with pretty high efficiency.

If you really need to handle real-time file update and access and to
control file locations, then you need to look beyond Hadoop to extensions
such as MapR which allow this control and have the required real-time file
semantics.


On Mon, Mar 25, 2013 at 8:16 AM, David Parks <davidparks21@yahoo.com> wrote:

> Can I suggest an answer of "Yes, but  you probably don't want to"?
>
> As a "typical user" of Hadoop you would not do this. Hadoop already chooses
> the best server to do the work based on the location of the data (a server
> that is available to do work and also has the data locally will generally
> be
> assigned to do that work). There are a couple of mechanisms for which you
> can do this. Neither of which I'm terribly familiar with so I'll just
> provide a brief introduction and you can research more deeply and ask more
> pointed questions.
>
> I believe there is some ability to "suggest" a good location to run a
> particular task in the InputFormat, thus if you extended, say,
> FileInputFormat you could inject some kind of recommendation, but it
> wouldn't force Hadoop to do one thing or another, it would just be a
> recommendation.
>
> The next place I'd look is at the scheduler, but you're gonna really get
> your hands dirty by digging in here and I doubt, from the tone of your
> email, that you'll have interest in digging to this level.
>
> But mostly, I would suggest you explain your use case more thoroughly and I
> bet you'll just be directed down a more logical path to accomplish your
> goals.
>
> David
>
>
> -----Original Message-----
> From: Fan Bai [mailto:fbai1@student.gsu.edu]
> Sent: Monday, March 25, 2013 5:24 AM
> To: user@hadoop.apache.org
> Subject:
>
>
> Dear Sir,
>
> I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a
> job (only one job in here), can I control the file to work in which node?
>
> For example, I have only one job and this job have 10 files (10 mapper need
> to run). Also in my severs, I have one head node and four working node. My
> question is: can I control those 10 files to working in which node? Such
> as:
> No.1 file work in node1, No.3 file work in node2, No.5 file work in node3
> and No.8 file work in node4.
>
> If I can do this, that means I can control the task. Is that means I still
> can control this file in next around (I have a loop in head node;I can do
> another mapreduce work). For example, I can set up No.5 file in 1st around
> worked node3 and I also can set up No.5 file work in node 2 in 2nd around.
>
> If I cannot, is that means, for Hadoop, the file will work in which node
> just like a "black box", the user cannot control the file will work in
> which
> node, because you think the user do not need control it, just let HDFS help
> them to finish the parallel work.
> Therefore, the Hadoop cannot control the task in one job, but can control
> the multiple jobs.
>
> Thank you so much!
>
>
>
> Fan Bai
> PhD Candidate
> Computer Science Department
> Georgia State University
> Atlanta, GA 30303
>
>

--e89a8f6474a77eab0704d8bab9fa
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I would agree with David that this is not normally a good =
idea.<div><br></div><div>There are situations, however, where you do need t=
o control location of data and where the computation occurs. =A0These requi=
rements, however, normally only come up in real-time or low-latency situati=
ons. =A0</div>

<div><br></div><div style>Ordinary Hadoop does not address those needs, by =
design. =A0This allows Hadoop to have a much simpler implementation and to =
handle a varied batch oriented workload with pretty high efficiency.</div>

<div style><br></div><div style>If you really need to handle real-time file=
 update and access and to control file locations, then you need to look bey=
ond Hadoop to extensions such as MapR which allow this control and have the=
 required real-time file semantics.</div>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon,=
 Mar 25, 2013 at 8:16 AM, David Parks <span dir=3D"ltr">&lt;<a href=3D"mail=
to:davidparks21@yahoo.com" target=3D"_blank">davidparks21@yahoo.com</a>&gt;=
</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Can I suggest an answer of &quot;Yes, but =
=A0you probably don&#39;t want to&quot;?<br>
<br>
As a &quot;typical user&quot; of Hadoop you would not do this. Hadoop alrea=
dy chooses<br>
the best server to do the work based on the location of the data (a server<=
br>
that is available to do work and also has the data locally will generally b=
e<br>
assigned to do that work). There are a couple of mechanisms for which you<b=
r>
can do this. Neither of which I&#39;m terribly familiar with so I&#39;ll ju=
st<br>
provide a brief introduction and you can research more deeply and ask more<=
br>
pointed questions.<br>
<br>
I believe there is some ability to &quot;suggest&quot; a good location to r=
un a<br>
particular task in the InputFormat, thus if you extended, say,<br>
FileInputFormat you could inject some kind of recommendation, but it<br>
wouldn&#39;t force Hadoop to do one thing or another, it would just be a<br=
>
recommendation.<br>
<br>
The next place I&#39;d look is at the scheduler, but you&#39;re gonna reall=
y get<br>
your hands dirty by digging in here and I doubt, from the tone of your<br>
email, that you&#39;ll have interest in digging to this level.<br>
<br>
But mostly, I would suggest you explain your use case more thoroughly and I=
<br>
bet you&#39;ll just be directed down a more logical path to accomplish your=
<br>
goals.<br>
<br>
David<br>
<br>
<br>
-----Original Message-----<br>
From: Fan Bai [mailto:<a href=3D"mailto:fbai1@student.gsu.edu">fbai1@studen=
t.gsu.edu</a>]<br>
Sent: Monday, March 25, 2013 5:24 AM<br>
To: <a href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache.org</a><br=
>
Subject:<br>
<br>
<br>
Dear Sir,<br>
<br>
I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a=
<br>
job (only one job in here), can I control the file to work in which node?<b=
r>
<br>
For example, I have only one job and this job have 10 files (10 mapper need=
<br>
to run). Also in my severs, I have one head node and four working node. My<=
br>
question is: can I control those 10 files to working in which node? Such as=
:<br>
No.1 file work in node1, No.3 file work in node2, No.5 file work in node3<b=
r>
and No.8 file work in node4.<br>
<br>
If I can do this, that means I can control the task. Is that means I still<=
br>
can control this file in next around (I have a loop in head node;I can do<b=
r>
another mapreduce work). For example, I can set up No.5 file in 1st around<=
br>
worked node3 and I also can set up No.5 file work in node 2 in 2nd around.<=
br>
<br>
If I cannot, is that means, for Hadoop, the file will work in which node<br=
>
just like a &quot;black box&quot;, the user cannot control the file will wo=
rk in which<br>
node, because you think the user do not need control it, just let HDFS help=
<br>
them to finish the parallel work.<br>
Therefore, the Hadoop cannot control the task in one job, but can control<b=
r>
the multiple jobs.<br>
<br>
Thank you so much!<br>
<br>
<br>
<br>
Fan Bai<br>
PhD Candidate<br>
Computer Science Department<br>
Georgia State University<br>
Atlanta, GA 30303<br>
<br>
</blockquote></div><br></div>

--e89a8f6474a77eab0704d8bab9fa--