hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Streaming data locality
Date Fri, 04 Feb 2011 03:52:38 GMT

On Feb 3, 2011, at 9:46 AM, Harsh J wrote:

> Hello,
> On Thu, Feb 3, 2011 at 10:46 PM, Keith Wiley <kwiley@keithwiley.com> wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> If the input to a streaming job is not actual data splits but simple HDFS file names
which are then read by the mappers, then how can data locality be achieved.
> Also, if you're only looking to not split the files, you can pass in a

The files won't be split, they're only 6MBs.  I'm looking to get the files to my streaming
job somehow and the method I've chosen is to send mere fileNAMES via the streaming API and
have the streaming program open the file from HDFS through a symbolic link in the distribute
cache (the link originating from -cacheFile presumably).

> custom FileInputFormat with isSplitable returning false? You'll lose
> completeness in locality cause of blocks not present in the chosen
> node though, yes -- But I believe that adding a hundred files to
> DistributedCache is not the solution, as the DistributedCache data is
> set to ALL the nodes AFAIK.

My understanding is that the -cacheFile option and the DistributedCache.addCacheFile() method
don't copy the entire file to the distributed cache, but rather make tiny symbolic links to
the actual HDFS file.  Correct?  If you don't think I should add 100s of files to the distributed
cache (or even 100s of links), then how else can I make the files available to my streaming

Put another way, do you know of another method by which to permit the streaming programs to
read files from HDFS?


Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson

View raw message