Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D0D17F6AA for ; Mon, 25 Mar 2013 07:27:13 +0000 (UTC) Received: (qmail 12672 invoked by uid 500); 25 Mar 2013 07:27:09 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 12314 invoked by uid 500); 25 Mar 2013 07:27:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12269 invoked by uid 99); 25 Mar 2013 07:27:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Mar 2013 07:27:02 +0000 X-ASF-Spam-Status: No, hits=3.9 required=5.0 tests=DEAR_SOMETHING,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Mar 2013 07:26:55 +0000 Received: by mail-la0-f48.google.com with SMTP id fq13so10766794lab.35 for ; Mon, 25 Mar 2013 00:26:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=Cfoh6sYBY8lu22Po5tqj4nRAWFMjvlzExY3bAZxRDKU=; b=BLe9rcUXfsoAlcX+R4D3POCdbBgRH9S4kQp9LLmiKFQlu02hrNnZIhX/jT4kvwb3fr dLp8nyCgpBGOjs0OSqIRBxyF8Nip6bkrgfw+4bnwCqhcRVLs7pN+lmaaV7Qdkk22PoJA V0b8B9Pwx1voyn/8ZmY/g2JjnmVQqCCfhYLSGdpzOF+JDq7u6Zbt7TViOI1C5kYTDKV2 SubvwuuqAB7V+qowSz0Y8PRIh+ioAYu0KmOUEXLt0j8qPxdrqbvod8XKhStRt+S7X1ft yniGsefybIPR/A4zil53tqW4RZC/wpTI5sOIWu213gVoa4cuxutrDZraLs63iWF0+5yx bM4g== X-Received: by 10.112.58.232 with SMTP id u8mr5477219lbq.96.1364196394438; Mon, 25 Mar 2013 00:26:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.37.5 with HTTP; Mon, 25 Mar 2013 00:26:14 -0700 (PDT) In-Reply-To: <024301ce2928$b5ae6920$210b3b60$@yahoo.com> References: <024301ce2928$b5ae6920$210b3b60$@yahoo.com> From: Ted Dunning Date: Mon, 25 Mar 2013 08:26:14 +0100 Message-ID: Subject: Re: To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=e89a8f6474a77eab0704d8bab9fa X-Gm-Message-State: ALoCoQnN+kRdCpbo5RQATQYlchjnGWpUB1dPsubnOvuwerlHo84iq1kZItsdIO+SF1UCQrFmCySU X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f6474a77eab0704d8bab9fa Content-Type: text/plain; charset=ISO-8859-1 I would agree with David that this is not normally a good idea. There are situations, however, where you do need to control location of data and where the computation occurs. These requirements, however, normally only come up in real-time or low-latency situations. Ordinary Hadoop does not address those needs, by design. This allows Hadoop to have a much simpler implementation and to handle a varied batch oriented workload with pretty high efficiency. If you really need to handle real-time file update and access and to control file locations, then you need to look beyond Hadoop to extensions such as MapR which allow this control and have the required real-time file semantics. On Mon, Mar 25, 2013 at 8:16 AM, David Parks wrote: > Can I suggest an answer of "Yes, but you probably don't want to"? > > As a "typical user" of Hadoop you would not do this. Hadoop already chooses > the best server to do the work based on the location of the data (a server > that is available to do work and also has the data locally will generally > be > assigned to do that work). There are a couple of mechanisms for which you > can do this. Neither of which I'm terribly familiar with so I'll just > provide a brief introduction and you can research more deeply and ask more > pointed questions. > > I believe there is some ability to "suggest" a good location to run a > particular task in the InputFormat, thus if you extended, say, > FileInputFormat you could inject some kind of recommendation, but it > wouldn't force Hadoop to do one thing or another, it would just be a > recommendation. > > The next place I'd look is at the scheduler, but you're gonna really get > your hands dirty by digging in here and I doubt, from the tone of your > email, that you'll have interest in digging to this level. > > But mostly, I would suggest you explain your use case more thoroughly and I > bet you'll just be directed down a more logical path to accomplish your > goals. > > David > > > -----Original Message----- > From: Fan Bai [mailto:fbai1@student.gsu.edu] > Sent: Monday, March 25, 2013 5:24 AM > To: user@hadoop.apache.org > Subject: > > > Dear Sir, > > I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a > job (only one job in here), can I control the file to work in which node? > > For example, I have only one job and this job have 10 files (10 mapper need > to run). Also in my severs, I have one head node and four working node. My > question is: can I control those 10 files to working in which node? Such > as: > No.1 file work in node1, No.3 file work in node2, No.5 file work in node3 > and No.8 file work in node4. > > If I can do this, that means I can control the task. Is that means I still > can control this file in next around (I have a loop in head node;I can do > another mapreduce work). For example, I can set up No.5 file in 1st around > worked node3 and I also can set up No.5 file work in node 2 in 2nd around. > > If I cannot, is that means, for Hadoop, the file will work in which node > just like a "black box", the user cannot control the file will work in > which > node, because you think the user do not need control it, just let HDFS help > them to finish the parallel work. > Therefore, the Hadoop cannot control the task in one job, but can control > the multiple jobs. > > Thank you so much! > > > > Fan Bai > PhD Candidate > Computer Science Department > Georgia State University > Atlanta, GA 30303 > > --e89a8f6474a77eab0704d8bab9fa Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I would agree with David that this is not normally a good = idea.

There are situations, however, where you do need t= o control location of data and where the computation occurs. =A0These requi= rements, however, normally only come up in real-time or low-latency situati= ons. =A0

Ordinary Hadoop does not address those needs, by = design. =A0This allows Hadoop to have a much simpler implementation and to = handle a varied batch oriented workload with pretty high efficiency.

If you really need to handle real-time file= update and access and to control file locations, then you need to look bey= ond Hadoop to extensions such as MapR which allow this control and have the= required real-time file semantics.


On Mon,= Mar 25, 2013 at 8:16 AM, David Parks <davidparks21@yahoo.com>= wrote:
Can I suggest an answer of "Yes, but = =A0you probably don't want to"?

As a "typical user" of Hadoop you would not do this. Hadoop alrea= dy chooses
the best server to do the work based on the location of the data (a server<= br> that is available to do work and also has the data locally will generally b= e
assigned to do that work). There are a couple of mechanisms for which you can do this. Neither of which I'm terribly familiar with so I'll ju= st
provide a brief introduction and you can research more deeply and ask more<= br> pointed questions.

I believe there is some ability to "suggest" a good location to r= un a
particular task in the InputFormat, thus if you extended, say,
FileInputFormat you could inject some kind of recommendation, but it
wouldn't force Hadoop to do one thing or another, it would just be a recommendation.

The next place I'd look is at the scheduler, but you're gonna reall= y get
your hands dirty by digging in here and I doubt, from the tone of your
email, that you'll have interest in digging to this level.

But mostly, I would suggest you explain your use case more thoroughly and I=
bet you'll just be directed down a more logical path to accomplish your=
goals.

David


-----Original Message-----
From: Fan Bai [mailto:fbai1@studen= t.gsu.edu]
Sent: Monday, March 25, 2013 5:24 AM
To: user@hadoop.apache.org Subject:


Dear Sir,

I have a question about Hadoop, when I use Hadoop and Mapreduce to finish a=
job (only one job in here), can I control the file to work in which node?
For example, I have only one job and this job have 10 files (10 mapper need=
to run). Also in my severs, I have one head node and four working node. My<= br> question is: can I control those 10 files to working in which node? Such as= :
No.1 file work in node1, No.3 file work in node2, No.5 file work in node3 and No.8 file work in node4.

If I can do this, that means I can control the task. Is that means I still<= br> can control this file in next around (I have a loop in head node;I can do another mapreduce work). For example, I can set up No.5 file in 1st around<= br> worked node3 and I also can set up No.5 file work in node 2 in 2nd around.<= br>
If I cannot, is that means, for Hadoop, the file will work in which node just like a "black box", the user cannot control the file will wo= rk in which
node, because you think the user do not need control it, just let HDFS help=
them to finish the parallel work.
Therefore, the Hadoop cannot control the task in one job, but can control the multiple jobs.

Thank you so much!



Fan Bai
PhD Candidate
Computer Science Department
Georgia State University
Atlanta, GA 30303


--e89a8f6474a77eab0704d8bab9fa--