hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guttadauro, Jeff" <jeff.guttada...@here.com>
Subject RE: Accessing files in Hadoop 2.7.2 Distributed Cache
Date Tue, 07 Jun 2016 22:20:03 GMT
Hi, Siddharth.

I was also a bit frustrated at what I found to be scant documentation on how to use the distributed
cache in Hadoop 2.  The DistributedCache class itself was deprecated in Hadoop 2, but there
don’t appear to be very clear instructions on the alternative.  I think it’s actually
much simpler to work with files on the distributed cache in Hadoop 2.  The new way is to add
files to the cache (or cacheArchive) via the Job object:


The cool part is that, if you set up your URI so that it has a “#yourFileReference” at
the end, then Hadoop will set up a symbolic link named “yourFileReference” in your job’s
working directory, which you can use to get at the file or archive.  So, it’s as if the
file or archive is in the working directory.  That obviates the need to even work with the
DistributedCache class in your Mapper or Reducer, since you can just work with the file (or
path using nio) directly.

Hope that helps.
From: Siddharth Dawar [mailto:siddharthdawar17@gmail.com]
Sent: Tuesday, June 07, 2016 4:06 AM
To: user@hadoop.apache.org
Subject: Accessing files in Hadoop 2.7.2 Distributed Cache

I want to use the distributed cache to allow my mappers to access data in Hadoop 2.7.2. In
main, I'm using the command

String hdfs_path="hdfs://localhost:9000/bloomfilter";

InputStream in = new BufferedInputStream(new FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));

Configuration conf = new Configuration();

fs = FileSystem.get(java.net.URI.create(hdfs_path), conf);

OutputStream out = fs.create(new Path(hdfs_path));

//Copy file from local to HDFS

IOUtils.copyBytes(in, out, 4096, true);

System.out.println(hdfs_path + " copied to HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(),

DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);

The above code adds a file present on my local file system to HDFS and adds it to the distributed

However, in my mapper code, when I try to access the file stored in distributed cache, the
Path[] P variable gets null value. d

public void configure(JobConf conf)


                               this.conf = conf;

                               try {

                                      Path [] p=DistributedCache.getLocalCacheFiles(conf);

                               } catch (IOException e) {

                                      // TODO Auto-generated catch block




Even when I tried to access distributed cache from the following code

in my mapper, the code returns the error that bloomfilter file doesn't exist

strm = new DataInputStream(new FileInputStream("bloomfilter"));

// Read into our Bloom filter.



However, I read somewhere that if we add a file to distributed cache, we can access it

directly from its name.

Can you please help me out ?

View raw message