hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taylor, Ronald C" <ronald.tay...@pnl.gov>
Subject RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Mon, 03 Jan 2011 20:43:55 GMT
Hi Dave,

Thanks for the suggestions. Glad to hear from a fellow DOE national lab person! 

We are just starting to explore all this here at Pacific Northwest Nat Lab, and what will
be going into Hbase and what will be left as files in HDFS is an open question, to be empirically
determined over the coming year. It will depend on upon what instrument data gets put in,
how the users want to analyze the data, what turns out to be practical for future growth and
maintenance, etc. My lab colleagues Kevin Fox and David Brown have a lot more experience handling
massive amount of data - they are already handling hundreds of TBs in the archive cluster
for EMSL, our national user facility (lots of mass spec, NMR, microscopy, and next gen sequencing
machines for biology and chemistry, as you may already know). And they have much better grip
on the hardware and OS side of things. So I imagine you & the list will be hearing directly
from them fairly often as questions arise.


-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Monday, January 03, 2011 12:21 PM
To: user@hbase.apache.org; 'general@hadoop.apache.org'
Cc: Fox, Kevin M; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file
system (or Hbase)?

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system into HDFS it is
not clear that parallelizing the load will help.  You have two sources of bottlenecks: the
nfs file system and the HDFS file system.  In your parallel example, you will likely saturate
your nfs file system first.  If they are actually local files, then loading them via M/R is
a non-starter as you have no control over which machine will get a map task.  Unless all of
the machines have files in the same directory and you are just going to look in that directory
to upload.  Then, it sounds like more of a job for a parallel shell command and less of a
map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then write a Map
job to load them into HBase.  You can try to do this and see if direct inserts into hbase
are fast enough for your use case.  But, if you are going to TBs/week then you will likely
want to investigate the bulk load features.  I haven't yet incorporated that into my workflow
so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g.,
with your compression turned on in hbase, see how much a 1 GB input file expands to inside
hbase / hdfs.  That should give you a feeling for how much space you will need for your expected
data load.


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file system
(or Hbase)?


We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster, with
Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes per
week. So - we are concerned about doing the uploads themselves as fast as possible from our
native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB range.

Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce program
would work fastest. So my idea would be to have a file listing all the data files (full paths)
to be uploaded, one per line, and then use that listing file as input to a MapReduce program.

Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal <source>
<dest>") in parallel with all the other Mappers, with the Mappers operating on all the
nodes of the cluster, spreading out the file upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? Anything else
out there for doing automated upload in parallel? We would very much appreciate advice in
this area, since we believe upload speed might become a bottleneck.

  - Ron Taylor

Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov

View raw message