hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Streaming data locality
Date Fri, 04 Feb 2011 03:56:52 GMT

On Feb 3, 2011, at 6:25 PM, Allen Wittenauer wrote:

> On Feb 3, 2011, at 9:16 AM, Keith Wiley wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> If the input to a streaming job is not actual data splits but simple HDFS file names
which are then read by the mappers, then how can data locality be achieved.
> 	If I understand your question, the method of processing doesn't matter.  The JobTracker
places tasks based on input locality.  So if you are providing the names of the file you want
as input as -input, then the JT will use the locations of those blocks.

Let's see here.  My streaming job has a single -input flag which points to a text file containing
HDFS paths.  Each line contains one TAB.  Are you saying that if the key (or the value) on
either side of that TAB is an HDFS file path then that record will be assigned to a task in
a data local manner?  Which is it that determines this locality, the key or the value?  (Must
be the key, right?)

> (Remember: streaming.jar is basically a big wrapper around the Java methods and the parameters
you pass to it are essentially the same as you'd provide to a "real" Java app.)
> 	Or are you saying your -input is a list of other files to read?  In the case, there
is no locality.  But again, streaming or otherwise makes no real difference.

Yes, basically.  The input is a list of HDFS file paths to be read and processed on a an individual

>> Likewise, is there any easier way to make those files accessible other than using
the -cacheFile flag?  
>> That requires building a very very long hadoop command (100s of files potentially).
 I'm worried about overstepping some command-line length limit...plus it would be nice to
do this programatically, say with the DistributedCache.addCacheFile() command, but that requires
writing your own driver, which I don't see how to do with streaming.
>> Thoughts?
> 	I think you need to give a more concrete example of what you are doing.  -cache is used
for sending files with your job and should have no bearing on what your input is to your job.
 Something tells me that you've cooked something up that is overly complex. :D

Good point, I'll write a better description of this later.  Thanks for the advice.

Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson

View raw message