hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Nguyen <andrew-lists-had...@ucsfcti.org>
Subject Splitting input for mapper and contiguous data
Date Fri, 16 Apr 2010 19:01:33 GMT
As I may have mentioned, my main goal currently is the processing of physiologic data using
hadoop and MR.  The steps are:

Convert ADC units to physical units (input is <sample num, raw value>, output is <sample
num, physical value>
Perform a peak detection to detect the systolic blood pressure (input is <sample num, physical
value>, output is <sample num, physical value> but the output is only a subset of
the input)
Calculate the central tendency measure using a sliding window (mapper input is <sample
num, physical value>, mapper output is <window ID, (sample num, physical value)>,
reducer input is <window ID, central tendency measurement at different radii> )

Each of the above steps builds upon the result of the previous.  So, for the first two steps,
I have been doing everything in the mapper and specified 0 reduce tasks.  The last step, I
am performing calculations on a sliding window of N points, skipping forward M points for
the next window.  N is >> M.  So, to implement this, I have a mapper that outputs all
of the x,y points (the value) for a particular key (the window ID).  The reducer then performs
the calculations on each window's data.  Everything works pretty well except that I noticed
the splitting of the input across different mappers affects the final output.  Due to the
nature of the calculations, this doesn't affect the end result very much.

However, I'm trying to make sure I understand everything properly, and I want to see if there
is a better/proper way of implementing something like this.  I'm guessing the problem comes
from the fact that I'm trying to use contiguous data points to create a window of N points.
 The window ID is just the first sample num encountered for the window.  As a result, the
first sample num encountered will change for everything but the first map task, when compared
to a serial execution.


View raw message