hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: I need advice on whether my starting data needs to be in HDFS
Date Mon, 19 May 2014 13:30:44 GMT
The reason why you want to copy to hdfs first is that hdfs splits the 
data and distributes it across the nodes in the cluster.  So if your 
input data is large, you'll get much better efficiency/speed in 
processing it if you're processing it in a distributed manner.  (I.e., 
multiple machines each processing a piece of it - multiple mappers.) 
I'd think that keeping the in NFS would be quite slow.



On 05/15/2014 04:45 PM, Steve Lewis wrote:
> I have a medium size data set in the terrabytes range that currently lives
> in the nfs file server of a medium institution. Every few months we want to
> run a chain of five Hadoop jobs on this data.
>     The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The
> book says copy the data to HDFS and run the job. If I consider copy to hdfs
> and the first mapper as a single task I wonder if it is not as easy to have
> a custom reader reading from the NFS file system as a local file and skip
> the step of copying to hadoop.
>     While the read to the mapper may be slower, dropping the copy to hdfs
> could well make up the difference. Assume that after the job runs the data
> will be deleted from hdfs - the nfs system is the primary source and that
> cannot change. Also the job is not I/O limited - there is significant
> computation at each step
>      My questions are
>    1) are my assumptions correct and not copying the data may save time?
>    2) would 200 Hadoop jobs overwhelm a medium sized nfs system?

View raw message