hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball" <aa...@cloudera.com>
Subject Re: Shared thread safe variables?
Date Wed, 24 Dec 2008 09:44:55 GMT
(Addendum to my own post -- an identity mapper is probably not what you
want. You'd actually want an inverting mapper that transforms (k, v) --> (v,
k), to take advantage of the key-based sorting.)

- Aaron

On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <aaron@cloudera.com> wrote:

> Hi Jim,
> The ability to perform locking of shared mutable state is a distinct
> anti-goal of the MapReduce paradigm. One of the major benefits of writing
> MapReduce programs is knowing that you don't have to worry about deadlock in
> your code. If mappers could lock objects, then the failure and restart
> semantics of individual tasks would be vastly more complicated. (What
> happens if a map task crashes after it obtains a lock? Does it automatically
> release the lock? Does some rollback mechanism undo everything that happened
> after the lock was acquired? How would that work if--by definition--the
> mapper node is no longer available?)
> A word frequency histogram function can certainly be written in MapReduce
> without such state. You've got the right intuition, but a serial program is
> not necessarily the best answer. Take the existing word count program. This
> converts bags of words into (word, count) pairs. Then pass this through a
> second pass, via an identity mapper to a set of combiners that each emit the
> 100 most frequent words, to a single reducer that emits the 100 most
> frequent words obtained by the combiners.
> Many other more complicated problems which seem to require shared state, in
> truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
> passes is a very valid technique for building more complex dataflows.
> Cheers,
> - Aaron
> On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <jim.twensky@gmail.com>wrote:
>> Hello,
>> I was wondering if Hadoop provides thread safe shared variables that can
>> be
>> accessed from individual mappers/reducers along with a proper locking
>> mechanism. To clarify things, let's say that in the word count example, I
>> want to know the word that has the highest frequency and how many times it
>> occured. I believe that the latter can be done using the counters that
>> come
>> with the Hadoop framework but I don't know how to get the word itself as a
>> String. Of course, the problem can be more complicated like the top 100
>> words or so.
>> I thought of writing a serial program which can go over the final output
>> of
>> the word count but this wouldn't be a good idea if the output file gets
>> too
>> large. However, if there is a way to define and use shared variables, this
>> would be really easy to do on the fly during the word count's reduce
>> phase.
>> Thanks,
>> Jim

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message