Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
User-Agent: Microsoft-Entourage/11.3.3.061214
Date: Wed, 19 Sep 2007 14:24:54 -0700
Subject: Re: docs on combining output from multiple map reduces please?
From: Ted Dunning <tdunning@veoh.com>
To: <hadoop-user@lucene.apache.org>
Message-ID: <C316E1B6.23DFF%tdunning@veoh.com>
Thread-Topic: docs on combining output from multiple map reduces please?
Thread-Index: Acf7A4Ubw+ibQGb2EdynoQAWy8rVfQ==
In-Reply-To: <8c7320670709191317r2407e928x8b613ef5fc8582bf@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit


You need to have a tag on the different counts.  To make your example more
specific, the input contains lines of text from different genres.  One
option is to tag the data when you do the count so that your input would be:

  line of text*

The map output would be

  (genre, word), 1

The reduce output would be

  (genre, word), cnt

Then your final comparison step would use word as key and the reduce would
see (genre, count) pairs as values for each word.  A simple statistical
comparison (see http://citeseer.ist.psu.edu/29096.html for one method) would
give you degree of shift for each word.

Tagging with genre at the original time of counting might be inconvenient.
A reasonable way to handle this is to store counts in directories that
indicate genre.  To handle this, you would use the input file name in the
final map to add the genre key.

You might also want to process all of your documents at once and use a table
that determines what genre each input file is.  To do this, you would use a
variant of the TextInputFormat that gives you file name as key and in the
map function, you would look up the genre for each record.


On 9/19/07 1:17 PM, "kate rhodes" <masukomi@gmail.com> wrote:

> I'm sure there's a doc on this somewhere and if someone can point me
> to it I'd be quite grateful:
> 
> what I'm looking to do is analyze the output of n prior MR runs and
> then see where the same thing showed up in all of them.
> 
> For example: do a word count run on sci-fi books, and then at some
> point later, a run on romance novels then, at some later point in the
> future, go back and find all the statistically significant words that
> appeared in both. These are three totally separate MR runs.
> 
> I'm sure this is a common, and easy to handle situation, I just don't
> have my head around Hadoop enough yet to know what I need to be
> searching for to get the right docs.
>