Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of buttler1@llnl.gov designates
 128.115.41.83 as permitted sender)
From: "Buttler, David" <buttler1@llnl.gov>
To: "user@hbase.apache.org" <user@hbase.apache.org>,
	"'general@hadoop.apache.org'" <general@hadoop.apache.org>
CC: "Fox, Kevin M" <kevin.fox@pnl.gov>, "Brown, David M JR"
	<david.brown@pnl.gov>
Date: Mon, 3 Jan 2011 12:20:53 -0800
Subject: RE: What is the fastest way to get a large amount of data into the
  Hadoop HDFS file system  (or Hbase)?
Thread-Topic: What is the fastest way to get a large amount of data into the
  Hadoop HDFS file system  (or Hbase)?
Thread-Index: Acul2gQhGBzjuFTiQL2qQO67SCiwKwA/gmuwASqbeAA=
Message-ID: 
 <2D6136772A13B84E95DF6DA79E85A9F00134476BEFDC@NSPEXMBX-A.the-lab.llnl.gov>
References: <590321.36008.qm@web31813.mail.mud.yahoo.com>
 <536061.77419.qm@web130101.mail.mud.yahoo.com>
 <A632B3F93D861843A3C966F3962C3F84010822D5000C@EMAIL05.pnl.gov>
 <4D18AF52.9080800@mozilla.com>
 <A632B3F93D861843A3C966F3962C3F84010822D50026@EMAIL05.pnl.gov>
In-Reply-To: <A632B3F93D861843A3C966F3962C3F84010822D50026@EMAIL05.pnl.gov>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi Ron,
Loading into HDFS and HBase are two different issues. =20

HDFS: if you have a large number of files to load from your nfs file system=
 into HDFS it is not clear that parallelizing the load will help.  You have=
 two sources of bottlenecks: the nfs file system and the HDFS file system. =
 In your parallel example, you will likely saturate your nfs file system fi=
rst.  If they are actually local files, then loading them via M/R is a non-=
starter as you have no control over which machine will get a map task.  Unl=
ess all of the machines have files in the same directory and you are just g=
oing to look in that directory to upload.  Then, it sounds like more of a j=
ob for a parallel shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and th=
en write a Map job to load them into HBase.  You can try to do this and see=
 if direct inserts into hbase are fast enough for your use case.  But, if y=
ou are going to TBs/week then you will likely want to investigate the bulk =
load features.  I haven't yet incorporated that into my workflow so I can't=
 offer much advice there. Just be sure your cluster is sized appropriately.=
  E.g., with your compression turned on in hbase, see how much a 1 GB input=
 file expands to inside hbase / hdfs.  That should give you a feeling for h=
ow much space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov]=20
Sent: Tuesday, December 28, 2010 2:05 PM
To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Had=
oop HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop=
 cluster, with Hbase operating on top of Hadoop. Figure eventually on the o=
rder of multiple terabytes per week. So - we are concerned about doing the =
uploads themselves as fast as possible from our native Linux file system in=
to HDFS. Figure files will be in, roughly, the 1 to 300 GB range.=20

Off the top of my head, I'm thinking that doing this in parallel using a Ja=
va MapReduce program would work fastest. So my idea would be to have a file=
 listing all the data files (full paths) to be uploaded, one per line, and =
then use that listing file as input to a MapReduce program.=20

Each Mapper would then upload one of the data files (using "hadoop fs -copy=
FromLocal <source> <dest>") in parallel with all the other Mappers, with th=
e Mappers operating on all the nodes of the cluster, spreading out the file=
 upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods?=
 Anything else out there for doing automated upload in parallel? We would v=
ery much appreciate advice in this area, since we believe upload speed migh=
t become a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov