Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of jim.twensky@gmail.com
 designates 209.85.218.16 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=UFDTTWzXlBllGyDWHOlPfRovctbyA+4G5Unqea6o/dZowIj6ma+INXpA3SpsisBMmy
         ogAfhR2BTTrU2BQtqL6N0FgmOP0uVyjk/sCTzjq6JrXWqGNu0PQf/rtfUw7hxqFAeiOI
         gaD+zkRhASnoHucq0BOY3zMxXF0klhmP4/slA=
Message-ID: <7a8854060812241019g7309df63kf141602ae5597e64@mail.gmail.com>
Date: Wed, 24 Dec 2008 12:19:14 -0600
From: "Jim Twensky" <jim.twensky@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Shared thread safe variables?
In-Reply-To: <d6d7c4410812240144o6e3e4b1eq9b2cb21786567cce@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_50232_29695666.1230142754786"
References: <7a8854060812240028m7adc4df3jd5abaa83db096782@mail.gmail.com>
	 <d6d7c4410812240143ia7ad98at8abbb897e4e08455@mail.gmail.com>
	 <d6d7c4410812240144o6e3e4b1eq9b2cb21786567cce@mail.gmail.com>

------=_Part_50232_29695666.1230142754786
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi Aaron,

Thanks for the advice. I actually thought of using multiple combiners and a
single reducer but I was worried about the key sorting phase to be a vaste
for my purpose. If the input is just a bunch of (word,count) pairs which is
in the order of TeraBytes, wouldn't sorting be an overkill? That's why I
thought a single serial program might perform better but I'm not sure how
long it would take to sort the keys in such a case so probably it is nothing
beyond speculation and I should go and give it a try to see how well it
performs.

Secondly, I didn't quite understand how I can take advantage of the sorted
keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs. In
both cases, the combiners and the single reducer will still have to iterate
over all the (v,k) pairs to find the top 100 right? Or is there a way to say
something like "Give me the last 100 keys" at each reducer/combiner?

Thanks in advance,
Jim

On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <aaron@cloudera.com> wrote:

> (Addendum to my own post -- an identity mapper is probably not what you
> want. You'd actually want an inverting mapper that transforms (k, v) -->
> (v,
> k), to take advantage of the key-based sorting.)
>
> - Aaron
>
> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <aaron@cloudera.com> wrote:
>
> > Hi Jim,
> >
> > The ability to perform locking of shared mutable state is a distinct
> > anti-goal of the MapReduce paradigm. One of the major benefits of writing
> > MapReduce programs is knowing that you don't have to worry about deadlock
> in
> > your code. If mappers could lock objects, then the failure and restart
> > semantics of individual tasks would be vastly more complicated. (What
> > happens if a map task crashes after it obtains a lock? Does it
> automatically
> > release the lock? Does some rollback mechanism undo everything that
> happened
> > after the lock was acquired? How would that work if--by definition--the
> > mapper node is no longer available?)
> >
> > A word frequency histogram function can certainly be written in MapReduce
> > without such state. You've got the right intuition, but a serial program
> is
> > not necessarily the best answer. Take the existing word count program.
> This
> > converts bags of words into (word, count) pairs. Then pass this through a
> > second pass, via an identity mapper to a set of combiners that each emit
> the
> > 100 most frequent words, to a single reducer that emits the 100 most
> > frequent words obtained by the combiners.
> >
> > Many other more complicated problems which seem to require shared state,
> in
> > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
> > passes is a very valid technique for building more complex dataflows.
> >
> > Cheers,
> > - Aaron
> >
> >
> >
> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <jim.twensky@gmail.com
> >wrote:
> >
> >> Hello,
> >>
> >> I was wondering if Hadoop provides thread safe shared variables that can
> >> be
> >> accessed from individual mappers/reducers along with a proper locking
> >> mechanism. To clarify things, let's say that in the word count example,
> I
> >> want to know the word that has the highest frequency and how many times
> it
> >> occured. I believe that the latter can be done using the counters that
> >> come
> >> with the Hadoop framework but I don't know how to get the word itself as
> a
> >> String. Of course, the problem can be more complicated like the top 100
> >> words or so.
> >>
> >> I thought of writing a serial program which can go over the final output
> >> of
> >> the word count but this wouldn't be a good idea if the output file gets
> >> too
> >> large. However, if there is a way to define and use shared variables,
> this
> >> would be really easy to do on the fly during the word count's reduce
> >> phase.
> >>
> >> Thanks,
> >> Jim
> >>
> >
> >
>

------=_Part_50232_29695666.1230142754786--