Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 50853 invoked from network); 24 Dec 2008 18:19:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Dec 2008 18:19:49 -0000 Received: (qmail 83256 invoked by uid 500); 24 Dec 2008 18:19:43 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 83218 invoked by uid 500); 24 Dec 2008 18:19:43 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 83207 invoked by uid 99); 24 Dec 2008 18:19:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Dec 2008 10:19:43 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jim.twensky@gmail.com designates 209.85.218.16 as permitted sender) Received: from [209.85.218.16] (HELO mail-bw0-f16.google.com) (209.85.218.16) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Dec 2008 18:19:35 +0000 Received: by bwz9 with SMTP id 9so8682411bwz.5 for ; Wed, 24 Dec 2008 10:19:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=etk1owTKBshxqLnqeZY7M9KT2THQq7x7drDoQcmY/Bc=; b=TD78lqPpM7sSkUcMHo9v+0ZMP22kg9DRFxdnxQfy3Q12oCrlf2GAb6ahrvhDcmvlzi OofRtin+Cvs6THJeI7nvtS3IQ4XsQXbIJl+a0+++9Eza4gwhu+aVi4mGsyi+Kr854Zw6 oc+3xI0GulqvfdjJ0z9pUIkChavdPusqhy1rM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=UFDTTWzXlBllGyDWHOlPfRovctbyA+4G5Unqea6o/dZowIj6ma+INXpA3SpsisBMmy ogAfhR2BTTrU2BQtqL6N0FgmOP0uVyjk/sCTzjq6JrXWqGNu0PQf/rtfUw7hxqFAeiOI gaD+zkRhASnoHucq0BOY3zMxXF0klhmP4/slA= Received: by 10.181.240.7 with SMTP id s7mr3231002bkr.110.1230142754800; Wed, 24 Dec 2008 10:19:14 -0800 (PST) Received: by 10.181.145.10 with HTTP; Wed, 24 Dec 2008 10:19:14 -0800 (PST) Message-ID: <7a8854060812241019g7309df63kf141602ae5597e64@mail.gmail.com> Date: Wed, 24 Dec 2008 12:19:14 -0600 From: "Jim Twensky" To: core-user@hadoop.apache.org Subject: Re: Shared thread safe variables? In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_50232_29695666.1230142754786" References: <7a8854060812240028m7adc4df3jd5abaa83db096782@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_50232_29695666.1230142754786 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Aaron, Thanks for the advice. I actually thought of using multiple combiners and a single reducer but I was worried about the key sorting phase to be a vaste for my purpose. If the input is just a bunch of (word,count) pairs which is in the order of TeraBytes, wouldn't sorting be an overkill? That's why I thought a single serial program might perform better but I'm not sure how long it would take to sort the keys in such a case so probably it is nothing beyond speculation and I should go and give it a try to see how well it performs. Secondly, I didn't quite understand how I can take advantage of the sorted keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs. In both cases, the combiners and the single reducer will still have to iterate over all the (v,k) pairs to find the top 100 right? Or is there a way to say something like "Give me the last 100 keys" at each reducer/combiner? Thanks in advance, Jim On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball wrote: > (Addendum to my own post -- an identity mapper is probably not what you > want. You'd actually want an inverting mapper that transforms (k, v) --> > (v, > k), to take advantage of the key-based sorting.) > > - Aaron > > On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball wrote: > > > Hi Jim, > > > > The ability to perform locking of shared mutable state is a distinct > > anti-goal of the MapReduce paradigm. One of the major benefits of writing > > MapReduce programs is knowing that you don't have to worry about deadlock > in > > your code. If mappers could lock objects, then the failure and restart > > semantics of individual tasks would be vastly more complicated. (What > > happens if a map task crashes after it obtains a lock? Does it > automatically > > release the lock? Does some rollback mechanism undo everything that > happened > > after the lock was acquired? How would that work if--by definition--the > > mapper node is no longer available?) > > > > A word frequency histogram function can certainly be written in MapReduce > > without such state. You've got the right intuition, but a serial program > is > > not necessarily the best answer. Take the existing word count program. > This > > converts bags of words into (word, count) pairs. Then pass this through a > > second pass, via an identity mapper to a set of combiners that each emit > the > > 100 most frequent words, to a single reducer that emits the 100 most > > frequent words obtained by the combiners. > > > > Many other more complicated problems which seem to require shared state, > in > > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple > > passes is a very valid technique for building more complex dataflows. > > > > Cheers, > > - Aaron > > > > > > > > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky >wrote: > > > >> Hello, > >> > >> I was wondering if Hadoop provides thread safe shared variables that can > >> be > >> accessed from individual mappers/reducers along with a proper locking > >> mechanism. To clarify things, let's say that in the word count example, > I > >> want to know the word that has the highest frequency and how many times > it > >> occured. I believe that the latter can be done using the counters that > >> come > >> with the Hadoop framework but I don't know how to get the word itself as > a > >> String. Of course, the problem can be more complicated like the top 100 > >> words or so. > >> > >> I thought of writing a serial program which can go over the final output > >> of > >> the word count but this wouldn't be a good idea if the output file gets > >> too > >> large. However, if there is a way to define and use shared variables, > this > >> would be really easy to do on the fly during the word count's reduce > >> phase. > >> > >> Thanks, > >> Jim > >> > > > > > ------=_Part_50232_29695666.1230142754786--