hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patterson, Josh" <jpatters...@tva.gov>
Subject RE: RecordReader design heuristic
Date Tue, 17 Mar 2009 21:38:02 GMT
So if I'm hearing you right, its "good" to send one point of data (10
bytes here) to a single mapper? This mind set increases the number of
mappers, but keeps their logic scaled down to simply "look at this
record and emit/don't emit" --- which is considered more favorable? I'm
still getting the hang of the MR design tradeoffs, thanks for your

Josh Patterson

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Tuesday, March 17, 2009 5:11 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual

points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.


Patterson, Josh wrote:
> I am currently working on a RecordReader to read a custom time series
> data binary file format and was wondering about ways to be most
> efficient in designing the InputFormat/RecordReader process. Reading
> through:
> http://wiki.apache.org/hadoop/HadoopMapReduce
> <http://wiki.apache.org/hadoop/HadoopMapReduce> 
> gave me a lot of hints about how the various classes work together in
> order to read any type of file. I was looking at how the
> uses the LineRecordReader in order to send individual lines to each
> mapper. My question is, what is a good heuristic in how to choose how
> much data to send to each mapper? With the stock LineRecordReader each
> mapper only gets to work with a single line which leads me to believe
> that we want to give each mapper very little work. Currently I'm
> at either sending each mapper a single point of data (10 bytes), which
> seems small, or sending a single mapper a block of data (around 819
> points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards
> the block to the mapper.
> These factors are based around dealing with a legacy file format (for
> now) so I'm just trying to make the best tradeoff possible for the
> term until I get some basic stuff rolling, at which point I can
> a better storage format, or just start converting the groups of stored
> points into a format more fitting for the platform. I understand that
> the InputFormat is not really trying to make much meaning out of the
> data, other than to help assist in getting the correct data out of the
> file based on the file split variables. Another question I have is,
> a pretty much stock install, generally how big is each FileSplit?
> Josh Patterson

View raw message