hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: providing auxiliary data to Map
Date Thu, 28 May 2009 04:26:44 GMT
Hi Rares,
Check out the Distributed Cache: http://wiki.apache.org/hadoop/FAQ#8


On Wed, May 27, 2009 at 9:24 PM, Rares Vernica <rvernica@gmail.com> wrote:

> Dear Hadoop Users,
> I am a newcomer into the Map-Reduce world. Please excuse my ignorance.
> I have two Map-Reduce phases. The first phase is the WordCount
> example. In the second phase, besides the regular input data, the Map
> function also needs the word-frequency table produced by the first
> phase.
> Obviously, the word-frequency table is small enough to fit into
> memory. Moreover, the first phase uses only one reduce, so that all
> the data is in one file in HDFS.
> My question is, what options do I have to efficiently get the
> word-frequency table to the map function of the second phase?
> One option is to access the HDFS form the map function and read the
> file produced by the first Map-Reduce phase. More exactly, I would
> read the file in the "setup" function. For this option, the machine
> that stores this file would become a bottleneck as when the second
> phase starts all the map instances will access that machine to get the
> file. Is there any way to overcome this bottleneck?
> Are there any other options?
> Thank you,
> Rares

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message