Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 53197 invoked from network); 29 Dec 2010 21:17:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Dec 2010 21:17:03 -0000 Received: (qmail 52145 invoked by uid 500); 29 Dec 2010 21:17:02 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 52024 invoked by uid 500); 29 Dec 2010 21:17:01 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 52016 invoked by uid 99); 29 Dec 2010 21:17:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Dec 2010 21:17:01 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dean.hiller@broadridge.com designates 64.18.2.157 as permitted sender) Received: from [64.18.2.157] (HELO exprod7og102.obsmtp.com) (64.18.2.157) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Dec 2010 21:16:53 +0000 Received: from source ([167.212.2.180]) (using TLSv1) by exprod7ob102.postini.com ([64.18.6.12]) with SMTP ID DSNKTRulKw74Bi5yQmVdEorNYCAdA+YjfArT@postini.com; Wed, 29 Dec 2010 13:16:33 PST Received: from JOSQHUBA01.jsq.bsg.ad.adp.com (josqhuba01 [149.83.60.74]) by mail.broadridge.com (8.13.8/8.13.8) with ESMTP id oBTLGQte5414978; Wed, 29 Dec 2010 16:16:26 -0500 Received: from DNVREMSA01.jsq.bsg.ad.adp.com ([206.88.42.250]) by JOSQHUBA01.jsq.bsg.ad.adp.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 29 Dec 2010 16:16:26 -0500 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Wed, 29 Dec 2010 14:16:22 -0700 Message-ID: <7CA0D5FE7FA83048893E9230C1E9C0280B8C0AF0@DNVREMSA01.jsq.bsg.ad.adp.com> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? Thread-Index: Acum6GiB7w6BPDa9Qom2VCQHANgdsgAApZCwACycgrA= References: <590321.36008.qm@web31813.mail.mud.yahoo.com> <536061.77419.qm@web130101.mail.mud.yahoo.com> <4D18AF52.9080800@mozilla.com> <1293579543.5455.150.camel@sledge.emsl.pnl.gov> From: "Hiller, Dean (Contractor)" To: , "Fox, Kevin M" , "Patrick Angeles" Cc: , "Brown, David M JR" X-OriginalArrivalTime: 29 Dec 2010 21:16:26.0962 (UTC) FILETIME=[A75A7B20:01CBA79D] I wonder if having linux mount hdfs would help here so as people put the file on your linux /hdfs directory, it was actually writing to hdfs and not linux ;) (yeah, you still have that one machine bottle neck as the files come in unless that can be clustered too somehow). Just google mounting hdfs from linux....something that sounds pretty cool that we may be using later. =20 Later, Dean -----Original Message----- From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov]=20 Sent: Tuesday, December 28, 2010 5:05 PM To: Fox, Kevin M; Patrick Angeles Cc: general@hadoop.apache.org; user@hbase.apache.org; Brown, David M JR; Taylor, Ronald C Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? =20 Hi Kevin, So - from what Patrick and Ted are saying it sounds like we want the best way to parallelize a source-based push, rather than doing a parallelized pull through a MapReduce program. And I see that what you ask about below is on parallelizing a push, so we are on the same page. Ron -----Original Message----- From: Fox, Kevin M=20 Sent: Tuesday, December 28, 2010 3:39 PM To: Patrick Angeles Cc: general@hadoop.apache.org; user@hbase.apache.org; Taylor, Ronald C; Brown, David M JR Subject: Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? On Tue, 2010-12-28 at 14:26 -0800, Patrick Angeles wrote: > Ron, >=20 >=20 > While MapReduce can help to parallelize the load effort, your likely=20 > bottleneck is the source system (where the files come from). If the=20 > files are coming from a single server, then parallelizing the load=20 > won't gain you much past a certain point. You have to figure in how=20 > fast you can read the file(s) off disk(s) and push the bits through=20 > your network and finally onto HDFS. >=20 >=20 > The best scenario is if you can parallelize the reads and have a fat=20 > network pipe (10GbE or more) going into your Hadoop cluster. We have a way to parallelize a push from the archive storage cluster to the hadoop storage cluster. Is there a way to target a particular storage node with a push into the hadoop file system? The hadoop cluster nodes are 1gig attached to its core switch and we have a 10 gig uplink to the core from the storage archive. Say, we have 4 nodes in each storage cluster (we have more, just a simplified example): a0 --\ /-- h0 a1 --+ +-- h1 a2 --+ (A switch) -10gige- (h switch) +-- h2 a3 --/ \-- h3 I want to be able to have a0 talk to h0 and not have h0 decide the data belongs on h3, slowing down a3's ability to write data into h3, greatly reducing bandwidth. Thanks, Kevin >=20 >=20 > Regards, >=20 >=20 > - Patrick >=20 > On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C=20 > wrote: > =20 > Folks, > =20 > We plan on uploading large amounts of data on a regular basis > onto a Hadoop cluster, with Hbase operating on top of Hadoop. > Figure eventually on the order of multiple terabytes per week. > So - we are concerned about doing the uploads themselves as > fast as possible from our native Linux file system into HDFS. > Figure files will be in, roughly, the 1 to 300 GB range. > =20 > Off the top of my head, I'm thinking that doing this in > parallel using a Java MapReduce program would work fastest. So > my idea would be to have a file listing all the data files > (full paths) to be uploaded, one per line, and then use that > listing file as input to a MapReduce program. > =20 > Each Mapper would then upload one of the data files (using > "hadoop fs -copyFromLocal ") in parallel with > all the other Mappers, with the Mappers operating on all the > nodes of the cluster, spreading out the file upload across the > nodes. > =20 > Does that sound like a wise way to approach this? Are there > better methods? Anything else out there for doing automated > upload in parallel? We would very much appreciate advice in > this area, since we believe upload speed might become a > bottleneck. > =20 > - Ron Taylor > =20 > ___________________________________________ > Ronald Taylor, Ph.D. > Computational Biology & Bioinformatics Group > =20 > Pacific Northwest National Laboratory > 902 Battelle Boulevard > P.O. Box 999, Mail Stop J4-33 > Richland, WA 99352 USA > Office: 509-372-6568 > Email: ronald.taylor@pnl.gov > =20 > =20 >=20 >=20 This message and any attachments are intended only for the use of the add= ressee and may contain information that is privileged and confidential. If the reade= r of the = message is not the intended recipient or an authorized representative of = the intended recipient, you are hereby notified that any dissemination of thi= s communication is strictly prohibited. If you have received this communica= tion in error, please notify us immediately by e-mail and delete the message and = any attachments from your system. =0D