hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Berry, Matt" <mwbe...@amazon.com>
Subject RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing
Date Thu, 28 Jun 2012 23:20:26 GMT
My end goal is to have all the records sorted chronologically, regardless of the source file.
To present it formally:

Let there are X servers.
Let each server produce one chronological log file that records who operated on the server
and when.
Let there be Y users.
Assume a given user can operate on any number of servers simultaneously.
Assume a given user can perform any number of operations a second.

My goal would be to have Y output files, each containing the records for only that user, sorted
chronologically.
So working backwards from the output.

In order for records to be written chronologically to the file:
- All records for a given user must arrive at the same reducer (or the file IO will mess with
the order)
- All records arriving at a given reducer must be chronological with respect to a given user

In order for records to arrive a reducer in chronological with respect to a given user:
- The sorter must be set to sort by time and operate over all records for a user

In order for the sorter to operate over all records for a user
- The grouper must be set to group by user, or not group at all (each record is a group)

In order for all records for a given user to arrive at the same reducer:
- The partitioner must be set to partition by user (i.e., user number mod number of partitions)

>From this vantage point I see two possible ways to do this.
1. Set the Key to be the user number, set the grouper to group by key. This results in all
records for a user being aggregated (very large)
2. Set they Key to be {user number, time}, set the grouper to group by key. This results in
each record being emitted to the reducer one at a time (lots of overhead)

Neither of those seems very favorable. Is anyone aware of a different means to achieve that
goal?


From: Steve Lewis [mailto:lordjoe2000@gmail.com] 
Sent: Thursday, June 28, 2012 3:43 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

It is NEVER a good idea to hold items in memory - after all this is big data and you want
it to scale - 
I do not see what stops you from reading one record, processing it and writing it out without
retaining it.
It is OK to keep statistics while iterating through a key and output them at the end but holding
all values for a key is almost never a good idea unless you can guarantee limits to these
On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <mwberry@amazon.com> wrote:
I have a MapReduce job that reads in several gigs of log files and separates the records based
on who generated them. My MapReduce job looks like this:
-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Mime
View raw message