Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D01529957 for ; Wed, 6 Mar 2013 03:44:14 +0000 (UTC) Received: (qmail 34115 invoked by uid 500); 6 Mar 2013 03:44:09 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 33563 invoked by uid 500); 6 Mar 2013 03:44:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 33540 invoked by uid 99); 6 Mar 2013 03:44:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 03:44:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of julianbui@gmail.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-la0-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 03:44:01 +0000 Received: by mail-la0-f41.google.com with SMTP id fo12so6945486lab.28 for ; Tue, 05 Mar 2013 19:43:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=1uwc3z9YsnLV1nUcrjTDFON4OQfh6ZdBGmTrSDqJ4lk=; b=K/9Hm4lP+lDmSR1bBQb880h4dgqSOtRXM98rajVT7BBPi/4fDOtr1fCViCYVgnnIkb x9PKkMY7tRQAg9m8qCpbc2oeJFGBCSoeNdFVqFeAN5cao9IFN3eJFjpE8HrIV+jJBtaM tOoRlVDeXYLM7sYcRefdCaXO+lcZEp8fL2d5tM7ImtXj05nKO/o95aEVo5++rs7kSRSU u8yjOXynjTRC+5iH8yEatsoqMx2xAhBDCBgcP4xOpOdZNTrNBNcmZTNqYJpDGsY8oqgP i+s06RPfH2bVsYQkfcqO6bsx86dpCUrsWpVPCGKnVhcUVQtNzjKOJ/BmfSPCBDFZm2Bw 3iqQ== MIME-Version: 1.0 X-Received: by 10.152.147.36 with SMTP id th4mr23732103lab.19.1362541420767; Tue, 05 Mar 2013 19:43:40 -0800 (PST) Received: by 10.112.76.38 with HTTP; Tue, 5 Mar 2013 19:43:40 -0800 (PST) In-Reply-To: References: Date: Tue, 5 Mar 2013 19:43:40 -0800 Message-ID: Subject: Re: basic question about rack awareness and computation migration From: Julian Bui To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f22c38160721804d7396585 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f22c38160721804d7396585 Content-Type: text/plain; charset=ISO-8859-1 Thanks Harsh, > Are your input lists big (for each compressed output)? And is the list arbitrary or a defined list per goal? I dictate what my inputs will look like. If they need to be list of image files, then I can do that. If they need to be the images themselves as you suggest, then I can do that too but I'm not exactly sure what that would look like. Basically, I will try to format my inputs in the way that makes the most sense from a locality point of view. Since all the keys must be writable, I explored the Writable interface and found the interesting sub-classes: - FileSplit - BlockLocation - BytesWritable These all look somewhat promising as they kind of reveal the location information of the files. I'm not exactly sure how I would use these to hint at the data locations. Since these chunks of the file appear to be somewhat arbitrary in size and offset, I don't know how I could perform imagery operations on them. For example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it difficult for me to use that information to give to my image libraries - does 0x100-0x400 correspond to some region/MBR within the image? I'm not sure how to make use of this information. The responses I've gotten so far indicate to me that HDFS kind of does the computation migration for me but that I have to give it enough information to work with. If someone could point to some detailed reading about this subject that would be pretty helpful, as I just can't find the documentation for it. Thanks again, -Julian On Tue, Mar 5, 2013 at 5:39 PM, Harsh J wrote: > Your concern is correct: If your input is a list of files, rather than > the files themselves, then the tasks would not be data-local - since > the task input would just be the list of files, and the files' data > may reside on any node/rack of the cluster. > > However, your job will still run as the HDFS reads do remote reads > transparently without developer intervention and all will still work > as you've written it to. If a block is found local to the DN, it is > read locally as well - all of this is automatic. > > Are your input lists big (for each compressed output)? And is the list > arbitrary or a defined list per goal? > > On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui wrote: > > Hi hadoop users, > > > > I'm trying to find out if computation migration is something the > developer > > needs to worry about or if it's supposed to be hidden. > > > > I would like to use hadoop to take in a list of image paths in the hdfs > and > > then have each task compress these large, raw images into something much > > smaller - say jpeg files. > > > > Input: list of paths > > Output: compressed jpeg > > > > Since I don't really need a reduce task (I'm more using hadoop for its > > reliability and orchestration aspects), my mapper ought to just take the > > list of image paths and then work on them. As I understand it, each > image > > will likely be on multiple data nodes. > > > > My question is how will each mapper task "migrate the computation" to the > > data nodes? I recall reading that the namenode is supposed to deal with > > this. Is it hidden from the developer? Or as the developer, do I need > to > > discover where the data lies and then migrate the task to that node? > Since > > my input is just a list of paths, it seems like the namenode couldn't > really > > do this for me. > > > > Another question: Where can I find out more about this? I've looked up > > "rack awareness" and "computation migration" but haven't really found > much > > code relating to either one - leading me to believe I'm not supposed to > have > > to write code to deal with this. > > > > Anyway, could someone please help me out or set me straight on this? > > > > Thanks, > > -Julian > > > > -- > Harsh J > --e89a8f22c38160721804d7396585 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Harsh,

> Are your input lists big (for each co= mpressed output)? And is the list
arbitrary or a defined list per goal?<= div>
I dictate what my inputs will look like. =A0If they need= to be list of image files, then I can do that. =A0If they need to be the i= mages themselves as you suggest, then I can do that too but I'm not exa= ctly sure what that would look like. =A0Basically, I will try to format my = inputs in the way that makes the most sense from a locality point of view.<= /div>

Since all the keys must be writable, I explored the Wri= table interface and found the interesting sub-classes:=A0
    FileSplit
  • BlockLocation
  • BytesWritable
These = all look somewhat promising as they kind of reveal the location information= of the files.=A0

I'm not exactly sure how I would use these to hint = at the data locations. =A0Since these chunks of the file appear to be somew= hat arbitrary in size and offset, I don't know how I could perform imag= ery operations on them. =A0For example, if I knew that bytes 0x100-0x400 li= e on node X, then that makes it difficult for me to use that information to= give to my image libraries - does 0x100-0x400 correspond to some region/MB= R within the image? =A0I'm not sure how to make use of this information= .

The responses I've gotten so far indicate to me tha= t HDFS kind of does the computation migration for me but that I have to giv= e it enough information to work with. =A0If someone could point to some det= ailed reading about this subject that would be pretty helpful, as I just ca= n't find the documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <= harsh@cloudera.com<= /a>> wrote:
Your concern is correct: If your input is a = list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <
julianbui@gmail.com> wrote:
> Hi hadoop users,
>
> I'm trying to find out if computation migration is something the d= eveloper
> needs to worry about or if it's supposed to be hidden.
>
> I would like to use hadoop to take in a list of image paths in the hdf= s and
> then have each task compress these large, raw images into something mu= ch
> smaller - say jpeg =A0files.
>
> Input: list of paths
> Output: compressed jpeg
>
> Since I don't really need a reduce task (I'm more using hadoop= for its
> reliability and orchestration aspects), my mapper ought to just take t= he
> list of image paths and then work on them. =A0As I understand it, each= image
> will likely be on multiple data nodes.
>
> My question is how will each mapper task "migrate the computation= " to the
> data nodes? =A0I recall reading that the namenode is supposed to deal = with
> this. =A0Is it hidden from the developer? =A0Or as the developer, do I= need to
> discover where the data lies and then migrate the task to that node? = =A0Since
> my input is just a list of paths, it seems like the namenode couldn= 9;t really
> do this for me.
>
> Another question: Where can I find out more about this? =A0I've lo= oked up
> "rack awareness" and "computation migration" but h= aven't really found much
> code relating to either one - leading me to believe I'm not suppos= ed to have
> to write code to deal with this.
>
> Anyway, could someone please help me out or set me straight on this? >
> Thanks,
> -Julian



--
Harsh J

--e89a8f22c38160721804d7396585--