hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Fox <Kevin....@pnl.gov>
Subject Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Tue, 28 Dec 2010 23:29:20 GMT
As I understand it, and please correct me if I'm wrong, a map/reduce job
has an instance of a FileSystem object on either side. One that the data
is read out of on the map side, and one the data is fed into on the
reduce side.

Can't you run the map reduce job on the storage cluster that stores the
archival data, feeding the map side with
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/RawLocalFileSystem.html
from the mounted, posix parallel file system,
and feed it into the hadoop cluster on the reduce side. That would mean,
the network in the middle would only see the reduced data set cross the
wire and you could parallelize the data reduction as close to the
archive as possible.

Thanks,
Kevin

On Tue, 2010-12-28 at 14:04 -0800, Taylor, Ronald C wrote:
> Folks,
> 
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster,
with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes
per week. So - we are concerned about doing the uploads themselves as fast as possible from
our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB
range. 
> 
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce
program would work fastest. So my idea would be to have a file listing all the data files
(full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce
program. 
> 
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal
<source> <dest>") in parallel with all the other Mappers, with the Mappers operating
on all the nodes of the cluster, spreading out the file upload across the nodes.
> 
> Does that sound like a wise way to approach this? Are there better methods? Anything
else out there for doing automated upload in parallel? We would very much appreciate advice
in this area, since we believe upload speed might become a bottleneck.
> 
>   - Ron Taylor
> 
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> 
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> 
> 



Mime
View raw message