Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 19755 invoked from network); 3 Jan 2011 20:22:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jan 2011 20:22:59 -0000 Received: (qmail 27611 invoked by uid 500); 3 Jan 2011 20:22:57 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 27546 invoked by uid 500); 3 Jan 2011 20:22:57 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Delivered-To: moderator for general@hadoop.apache.org Received: (qmail 22665 invoked by uid 99); 3 Jan 2011 20:21:37 -0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of buttler1@llnl.gov designates 128.115.41.83 as permitted sender) X-Attachments: None From: "Buttler, David" To: "user@hbase.apache.org" , "'general@hadoop.apache.org'" CC: "Fox, Kevin M" , "Brown, David M JR" Date: Mon, 3 Jan 2011 12:20:53 -0800 Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? Thread-Topic: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)? Thread-Index: Acul2gQhGBzjuFTiQL2qQO67SCiwKwA/gmuwASqbeAA= Message-ID: <2D6136772A13B84E95DF6DA79E85A9F00134476BEFDC@NSPEXMBX-A.the-lab.llnl.gov> References: <590321.36008.qm@web31813.mail.mud.yahoo.com> <536061.77419.qm@web130101.mail.mud.yahoo.com> <4D18AF52.9080800@mozilla.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi Ron, Loading into HDFS and HBase are two different issues. =20 HDFS: if you have a large number of files to load from your nfs file system= into HDFS it is not clear that parallelizing the load will help. You have= two sources of bottlenecks: the nfs file system and the HDFS file system. = In your parallel example, you will likely saturate your nfs file system fi= rst. If they are actually local files, then loading them via M/R is a non-= starter as you have no control over which machine will get a map task. Unl= ess all of the machines have files in the same directory and you are just g= oing to look in that directory to upload. Then, it sounds like more of a j= ob for a parallel shell command and less of a map/reduce command. HBase: So far my strategy has been to get the files into HDFS first, and th= en write a Map job to load them into HBase. You can try to do this and see= if direct inserts into hbase are fast enough for your use case. But, if y= ou are going to TBs/week then you will likely want to investigate the bulk = load features. I haven't yet incorporated that into my workflow so I can't= offer much advice there. Just be sure your cluster is sized appropriately.= E.g., with your compression turned on in hbase, see how much a 1 GB input= file expands to inside hbase / hdfs. That should give you a feeling for h= ow much space you will need for your expected data load. Dave -----Original Message----- From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov]=20 Sent: Tuesday, December 28, 2010 2:05 PM To: 'user@hbase.apache.org'; 'general@hadoop.apache.org' Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR Subject: What is the fastest way to get a large amount of data into the Had= oop HDFS file system (or Hbase)? Folks, We plan on uploading large amounts of data on a regular basis onto a Hadoop= cluster, with Hbase operating on top of Hadoop. Figure eventually on the o= rder of multiple terabytes per week. So - we are concerned about doing the = uploads themselves as fast as possible from our native Linux file system in= to HDFS. Figure files will be in, roughly, the 1 to 300 GB range.=20 Off the top of my head, I'm thinking that doing this in parallel using a Ja= va MapReduce program would work fastest. So my idea would be to have a file= listing all the data files (full paths) to be uploaded, one per line, and = then use that listing file as input to a MapReduce program.=20 Each Mapper would then upload one of the data files (using "hadoop fs -copy= FromLocal ") in parallel with all the other Mappers, with th= e Mappers operating on all the nodes of the cluster, spreading out the file= upload across the nodes. Does that sound like a wise way to approach this? Are there better methods?= Anything else out there for doing automated upload in parallel? We would v= ery much appreciate advice in this area, since we believe upload speed migh= t become a bottleneck. - Ron Taylor ___________________________________________ Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory 902 Battelle Boulevard P.O. Box 999, Mail Stop J4-33 Richland, WA 99352 USA Office: 509-372-6568 Email: ronald.taylor@pnl.gov