hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: basic question about rack awareness and computation migration
Date Thu, 07 Mar 2013 12:35:22 GMT
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.



On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <julianbui@gmail.com> wrote:

> Thanks Harsh,
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
> These all look somewhat promising as they kind of reveal the location
> information of the files.
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
> Thanks again,
> -Julian
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <harsh@cloudera.com> wrote:
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <julianbui@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
>> > this.  Is it hidden from the developer?  Or as the developer, do I need
>> to
>> > discover where the data lies and then migrate the task to that node?
>>  Since
>> > my input is just a list of paths, it seems like the namenode couldn't
>> really
>> > do this for me.
>> >
>> > Another question: Where can I find out more about this?  I've looked up
>> > "rack awareness" and "computation migration" but haven't really found
>> much
>> > code relating to either one - leading me to believe I'm not supposed to
>> have
>> > to write code to deal with this.
>> >
>> > Anyway, could someone please help me out or set me straight on this?
>> >
>> > Thanks,
>> > -Julian
>> --
>> Harsh J

View raw message