Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 24320 invoked from network); 19 Sep 2007 21:25:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Sep 2007 21:25:39 -0000 Received: (qmail 92641 invoked by uid 500); 19 Sep 2007 21:25:29 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 92604 invoked by uid 500); 19 Sep 2007 21:25:29 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 92595 invoked by uid 99); 19 Sep 2007 21:25:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2007 14:25:29 -0700 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2007 21:27:23 +0000 Received: from 206.169.1.36 ([206.169.1.36]) by ex9.hostedexchange.local ([69.50.2.13]) with Microsoft Exchange Server HTTP-DAV ; Wed, 19 Sep 2007 21:25:02 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Wed, 19 Sep 2007 14:24:54 -0700 Subject: Re: docs on combining output from multiple map reduces please? From: Ted Dunning To: Message-ID: Thread-Topic: docs on combining output from multiple map reduces please? Thread-Index: Acf7A4Ubw+ibQGb2EdynoQAWy8rVfQ== In-Reply-To: <8c7320670709191317r2407e928x8b613ef5fc8582bf@mail.gmail.com> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org You need to have a tag on the different counts. To make your example more specific, the input contains lines of text from different genres. One option is to tag the data when you do the count so that your input would be: line of text* The map output would be (genre, word), 1 The reduce output would be (genre, word), cnt Then your final comparison step would use word as key and the reduce would see (genre, count) pairs as values for each word. A simple statistical comparison (see http://citeseer.ist.psu.edu/29096.html for one method) would give you degree of shift for each word. Tagging with genre at the original time of counting might be inconvenient. A reasonable way to handle this is to store counts in directories that indicate genre. To handle this, you would use the input file name in the final map to add the genre key. You might also want to process all of your documents at once and use a table that determines what genre each input file is. To do this, you would use a variant of the TextInputFormat that gives you file name as key and in the map function, you would look up the genre for each record. On 9/19/07 1:17 PM, "kate rhodes" wrote: > I'm sure there's a doc on this somewhere and if someone can point me > to it I'd be quite grateful: > > what I'm looking to do is analyze the output of n prior MR runs and > then see where the same thing showed up in all of them. > > For example: do a word count run on sci-fi books, and then at some > point later, a run on romance novels then, at some later point in the > future, go back and find all the statistically significant words that > appeared in both. These are three totally separate MR runs. > > I'm sure this is a common, and easy to handle situation, I just don't > have my head around Hadoop enough yet to know what I need to be > searching for to get the right docs. >