hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mfc <mikefconn...@verizon.net>
Subject Re: Using Map/Reduce without HDFS?
Date Mon, 27 Aug 2007 05:46:55 GMT


I can see a benefit to this approach if it replaces random
access of a local file system with sequential access to 
large files in HDFS. We are talking about physical disks and
seek time is expensive.

But the random access of the local file system still happens, 
it just gets moved to the pre-processing step.

How about walking thru the relative cost of this pre-processing step
(which still must do random access), and some approaches to how
this could be done. You mentioned cat | gzip (assuming parallel instances
of this), is that what you do?


Ted Dunning-3 wrote:
> Yes.  I am recommending a pre-processing step before the map-reduce
> program.
> And yes. They do get split up again.  They also get copied to multiple
> nodes
> so that the reads can proceed in parallel.  The most important effects of
> concatenation and importing into HDFS are the parallelism and the reading
> of
> sequential disk blocks in processing.
> How many replicas, how many large files and how small the splits are
> determines the number of map functions that you can run in parallel
> without
> getting IO bound.
> If you are working on a small problem, then running Hadoop on a single
> node
> works just fine and accessing the local file system works just fine, but
> if
> you can do that, you might as well just write a sequential program in the
> first place.  If you have a large problem that requires parallelism, then
> reading from a local file system is likely be be a serious bottle neck.
> This is particularly true if you are processing your data repeatedly as is
> relatively common when, say, doing log processing of various kinds at
> multiple time scales.
> On 8/26/07 5:45 PM, "mfc" <mikefconnell@verizon.net> wrote:
>> [concatenation .. Compression]...but then the map/reduce job in HADOOP
>> breaks
> the large files back down
>> into small chunks. This is what prompted the question in the first place
>> about running Map/Reduce directly on the small files in the local file
>> system.
>> I'm wondering if doing the conversion to large files and copy into HDFS
>> would introduce a lot of overhead that would not be neccessary if
>> map/reduce
>> could be run directly on the local file system on the small files.

View this message in context: http://www.nabble.com/Using-Map-Reduce-without-HDFS--tf4331338.html#a12341811
Sent from the Hadoop Users mailing list archive at Nabble.com.

View raw message