hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Streaming data locality
Date Thu, 03 Feb 2011 17:16:16 GMT
I've seen this asked before, but haven't seen a response yet.

If the input to a streaming job is not actual data splits but simple HDFS file names which
are then read by the mappers, then how can data locality be achieved.

Likewise, is there any easier way to make those files accessible other than using the -cacheFile
flag?  That requires building a very very long hadoop command (100s of files potentially).
 I'm worried about overstepping some command-line length limit...plus it would be nice to
do this programatically, say with the DistributedCache.addCacheFile() command, but that requires
writing your own driver, which I don't see how to do with streaming.

Thoughts?

Thanks.

________________________________________________________________________________
Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
  -- Abe (Grandpa) Simpson
________________________________________________________________________________




Mime
View raw message