hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Angeles <patr...@cloudera.com>
Subject Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Tue, 28 Dec 2010 22:26:54 GMT

While MapReduce can help to parallelize the load effort, your likely
bottleneck is the source system (where the files come from). If the files
are coming from a single server, then parallelizing the load won't gain you
much past a certain point. You have to figure in how fast you can read the
file(s) off disk(s) and push the bits through your network and finally onto

The best scenario is if you can parallelize the reads and have a fat network
pipe (10GbE or more) going into your Hadoop cluster.


- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C <ronald.taylor@pnl.gov>wrote:

> Folks,
> We plan on uploading large amounts of data on a regular basis onto a Hadoop
> cluster, with Hbase operating on top of Hadoop. Figure eventually on the
> order of multiple terabytes per week. So - we are concerned about doing the
> uploads themselves as fast as possible from our native Linux file system
> into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.
> Off the top of my head, I'm thinking that doing this in parallel using a
> Java MapReduce program would work fastest. So my idea would be to have a
> file listing all the data files (full paths) to be uploaded, one per line,
> and then use that listing file as input to a MapReduce program.
> Each Mapper would then upload one of the data files (using "hadoop fs
> -copyFromLocal <source> <dest>") in parallel with all the other Mappers,
> with the Mappers operating on all the nodes of the cluster, spreading out
> the file upload across the nodes.
> Does that sound like a wise way to approach this? Are there better methods?
> Anything else out there for doing automated upload in parallel? We would
> very much appreciate advice in this area, since we believe upload speed
> might become a bottleneck.
>  - Ron Taylor
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message