hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject Re: Is it wrong to bypass HDFS?
Date Sun, 09 Nov 2014 20:48:22 GMT
You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification  for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as splits and write a custom
RecordReader to return the values.
For your problem a RecordReader reading only a single record from the
database might be very reasonable

On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <drdwitte@gmail.com> wrote:

> 100MB is very small, so the overhead of putting the data in hdfs is also
> very small. Does it even make sense to optimize this? (reading/writing will
> only take a second or so) If you don't want to stream data to hdfs and you
> have very little data then you should look in to alternative high
> performance paradigms such as OpenMP or MPI I think..
> Regards, D
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com>:
>> Hi,
>> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
>> job. The amount of data is small -- about 100MB per job, and about 0.25MB
>> per work item -- but the reduce phase is very CPU-intensive, requiring
>> about 30 seconds to reduce each mapper's output to a single value. The goal
>> is to speed up the computation by distributing the tasks across many
>> machines.
>> I am not sure how the mappers would work in this scenario. My initial
>> thought was that there would be one mapper per reducer, and each mapper
>> would fetch its input directly from the source database, using an input key
>> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
>> would then do some necessary fix-up and massaging of the data to prepare it
>> for the reduction phase.
>> However, none of the tutorials and example code I’ve seen do it this way.
>> They always copy the data from the source database to HDFS first. For my
>> use case, this seems wasteful. The data per task is very small and can fit
>> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
>> data” redundant storage. Also, the data is read only once per task, so
>> there’s nothing to be gained by the data locality optimizations of HDFS.
>> Having to copy the data to an intermediate data store seems unnecessary and
>> just adds overhead in this case.
>> Is it okay to bypass HDFS for certain types of problems, such as this
>> one? Or is there some reason mappers should never perform external I/O? I
>> am very new to Hadoop so I don’t have much experience to go on here. Thank
>> you,
>> Trevor

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

View raw message