hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deniz Demir <denizde...@me.com>
Subject Re: Working with MapFiles
Date Thu, 29 Mar 2012 15:43:41 GMT
Not sure if this helps in your use case but you can put all output file into distributed cache
and then access them in the subsequent map-reduce job (in driver code):

	// previous mr-job's output
	String pstr = "hdfs://<output_path/";         
	FileStatus[] files = fs.listStatus(new Path(pstr));
	for (FileStatus f : files) {
		if (!f.isDir()) {
			DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration());

I think you can also copy these files to a different location in dfs and then put into distributed


On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote:

> Hello,
> I have a MapFile as a product of MapReduce job, and what I need to do is:
> 1. If MapReduce produced more spilts as Output, merge them to single file.
> 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache
file for another MapReduce job.
> I'm wondering if it is even possible to merge MapFiles according to their nature and
use them as Distributed cache file.
> What I'm trying to achieve is repeatedly fast search in this file during another MapReduce
> If my idea is absolute wrong, can you give me any tip how to do it?
> The file is supposed to be 20MB large.
> I'm using Hadoop 0.20.203.
> Thanks for your reply:)
> Ondrej Klimpera

View raw message