hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ioan Eugen Stan <stan.ieu...@gmail.com>
Subject Re: Working with MapFiles
Date Fri, 30 Mar 2012 10:49:39 GMT
Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:
> Hello,
>
> I have a MapFile as a product of MapReduce job, and what I need to do is:
>
> 1. If MapReduce produced more spilts as Output, merge them to single file.
>
> 2. Copy this merged MapFile to another HDFS location and use it as a
> Distributed cache file for another MapReduce job.
> I'm wondering if it is even possible to merge MapFiles according to
> their nature and use them as Distributed cache file.

A MapFile is actually two files [1]: one SequanceFile (with sorted keys) 
and a small index for that file. The map file does a version of binary 
search to find your key and performs seek() to go to the byte offset in 
the file.

> What I'm trying to achieve is repeatedly fast search in this file during
> another MapReduce job.
> If my idea is absolute wrong, can you give me any tip how to do it?
>
> The file is supposed to be 20MB large.
> I'm using Hadoop 0.20.203.

If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it will 
provide you with any benefits.

> Thanks for your reply:)
>
> Ondrej Klimpera

[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
[2] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

Mime
View raw message