hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffry Roberts <geoffry.robe...@gmail.com>
Subject Re: Number of Reducers Set to One
Date Thu, 12 May 2011 20:11:21 GMT

Thanks for such a thoughtful response.

I have a data set that represents all the people that pass through Las Vegas
over a course of time, say five years, which comes to about 175 - 200
million people.   Each record is a person, and it contains fields for where
they came from, left to; times of arrival and departure, and infectious
state for a given disease, let's say its influenza.

Playing a what if game, I need to compute the probable disease state of each
person upon departure.  To do this, one thing -- one on a number of things
-- I must do keep a running total of how many people are present in Las
Vegas at any moment in time.

I was thinking that running everything through a single reducer would be the
best way of accomplishing this.  Your side effect file: Each reducer write
out its accumulation to its own file then you merge these together into one
big accumulation. Right?

On 12 May 2011 11:51, Robert Evans <evans@yahoo-inc.com> wrote:

>  Geoffry,
> That really depends on how much data you are processing, and the algorithm
> you need to use to process the data.  I did something similar a while ago
> with a medium amount of data and we saw significant speed up by first
> assigning each record a new key based off of the expected range of keys and
> just dividing them up equally into N buckets (our data was very evenly
> distributed and we know the range of the data, but terrasort does something
> very similar, and has tools to help if your data is not distributed evenly).
>  Each bucket was then sent to a different reducer, and as we output the now
> sorted data, we also accumulated some metrics about the data as we saw it go
> by.  We output these metrics to a side effect file, one per bucket.  Because
> the number and size of these side effect files was rather small we were then
> able to pull all of them into all map processes as a cache archive to do a
> second pass over the data and then calculate the accumulated values for this
> particular record based off of all buckets before the current one, and all
> records in this bucket before the current record (had to make sure the map
> did not split the sorted file that is the output of the previous reduce).
>  It does not follow a pure map/reduce paradigm, and it took a little bit of
> math to figure out exactly what values we had to accumulate and convince
> ourselves that it would produce the same results, but it worked for us.  I
> am not familiar with what you are trying to compute but it could work for
> you too.
> --Bobby Evans
> On 5/12/11 12:44 PM, "Geoffry Roberts" <geoffry.roberts@gmail.com> wrote:
> All,
> I am mostly seeking confirmation as to my thinking on this matter.
> I have an MR job that I believe will force me into using a single reducer.
> The nature of the process is one where calculations performed on a given
> record rely on certain accumulated values whose calculation depends on
> rolling values from all prior records.  An ultra simple example of this
> would be a balance forward situation.  (I'm not doing accounting I'm doing
> epidemiology, but the concept is the same.)
> Is a single reducer the best way to go in this?
> Thanks

Geoffry Roberts

View raw message