hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Armstrong <john.armstr...@ccri.com>
Subject Re: DistributedCache
Date Tue, 07 Jun 2011 13:52:24 GMT
On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." <gordoslocos@gmail.com>
> Not 100% clear on what you meant. You are saying I should put the file
> my HDFS cluster or should I use DistributedCache? If you suggest the
> latter,
> could you address my original question?

I mean that you can certainly get away with putting information into a
known place on HDFS and loading it in each mapper or reducer, but that may
become very inefficient as your problem scales up.  Mostly I was responding
to Shi Yu's question about why the DC is even worth using at all.

As to your question, here's how I do it, which I think I basically lifted
from an example in The Definitive Guide.  There may be better ways, though.

In my setup, I put files into the DC by getting Path objects (which should
be able to reference either HDFS or local filesystem files, though I always
have my files on HDFS to start) and using

  DistributedCache.addCacheFile(path.toUri(), conf);

Then within my mapper or reducer I retrieve all the cached files with

  Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

IIRC, this is what you were doing.  The problem is this gets all the
cached files, although they are now in a working directory on the local
filesystem.  Luckily, I know the filename of the file I want, so I iterate

  for (Path cachePath : cacheFiles) {
    if (cachePath.getName().equals(cachedFilename)) {
      return cachePath;

Then I've got the path to the local filesystem copy of the file I want in
hand and I can do whatever I want with it.


View raw message