Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 519159C95 for ; Tue, 31 Jan 2012 08:19:42 +0000 (UTC) Received: (qmail 76158 invoked by uid 500); 31 Jan 2012 08:19:40 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 75879 invoked by uid 500); 31 Jan 2012 08:19:33 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 75858 invoked by uid 99); 31 Jan 2012 08:19:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Jan 2012 08:19:29 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-tul01m020-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Jan 2012 08:19:22 +0000 Received: by obbta7 with SMTP id ta7so6886148obb.14 for ; Tue, 31 Jan 2012 00:19:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=FmOirDZqmK08KSBWI20dRXgGCsC98ElI07hqSCBuV54=; b=oRNF32YTMval/R7IO/7fMLoeu2t8szkmOj5+480j50HqGC57n5zospF/JDU57FdNK8 alYje89ZlNda3/qfUcD0cfLKMSLG/7vGW89zhyqg0h5byWf2kBMSGXTjqhD3+iWQxfv3 txWE+HNJvdxo2as39PMahbvpLGUWe2UyGJ5rI= MIME-Version: 1.0 Received: by 10.182.1.8 with SMTP id 8mr34391908obi.11.1327997941737; Tue, 31 Jan 2012 00:19:01 -0800 (PST) Received: by 10.182.15.195 with HTTP; Tue, 31 Jan 2012 00:19:01 -0800 (PST) In-Reply-To: References: <001b01ccdf55$d5a55c20$80f01460$@com> Date: Tue, 31 Jan 2012 09:19:01 +0100 Message-ID: Subject: Re: Faster Bulkload from Oracle to HBase From: Tim Robertson To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Laxman, We use both #1 and #3 from MySQL which also has hi speed exports. For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it is closer to 8 hrs to our 3 node cluster. We are having issues with delimiters though (since we have \r, \t and \n in the database), and now using Avro as a compression to overcome these. A colleague of mine is currently doing 2 patches for Sqoop to fix some bugs in this. For the majority of our processing we don't use HBase but straight to HDFS, and we use Sqoop wrapping it all up in Oozie workflows which is working very nicely. We do a lot of ETL processing from one DB to another through an Oozie workflow having Sqoop and Hive. HTH, Tim On Tue, Jan 31, 2012 at 7:21 AM, Jonathan Hsieh wrote: > Hi Laxman, > > I'm an Apache HBase and Sqoop committer. =A0I haven't run the comparison > you've suggested but my first thoughts are to consider #1 and #3. > > Case 1 will natively export data which should be the fastest way to get > data out of Oracle. =A0You many need to do some reformatting to use HBase= 's > importtsv. > > Case 2, writing your own DBInputFormat, is essentially duplicating what > Sqoop does in the generic case. =A0Here it is essentially executing sql > queries against the rdbms to get data out which will likely be a bit slow= er > than a database's native bulk export feature. > > Case 3, consider using Apache Sqoop in conjunction with Quest's data > connector for oracle and hadoop. =A0It is a free as in beer plugin to sqo= op! > =A0This is probably the fastest from a getting started point of view (no = dev > time) but may not be as performant as #1. =A0I'd give it a try. > http://www.quest.com/data-connector-for-oracle-and-hadoop/ > > If you have more questions about Sqoop, feel free to cross post to > sqoop-user@incubator.apache.org! > > Jon. > > On Mon, Jan 30, 2012 at 5:48 AM, Laxman wrote: > >> Hi, >> >> We have the following use-case. >> >> We have data in relational database (Oracle). >> We need to export this data to HBase and perform analysis on this data. >> We need to perform this export-import 500G periodically, say every month= . >> >> Following are the different approaches I can see as per my knowledge. >> Before testing and finding out the best way by myself, I wanted to liste= n >> from the experts here. >> >> Approach #1 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> 1) Export from Oracle to raw text file (Using Oracle export utility - >> Faster >> - Involves no transactional overhead) >> >> 2) Upload text file to HDFS >> >> 3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad()) >> >> Approach #2 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> 1) Write a custom Job using DBInputFormat to directly read from database= . >> =A0 =A0 =A0 =A0- Just a thought to avoid multiple hops(Oracle to Local F= S, Local FS >> to HDFS, HDFS to HBase) involved in approach #1. >> >> 2) Use the HBase bulk load tool to load this data to >> HBase.(HFileOutputFormat.configureIncrementalLoad()) >> >> Approach #3 >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> 1) Use Apache Sqoop (Currently under incubation) to achieve my requireme= nt. >> =A0 =A0 =A0 =A0- I'm not aware of the istability of this. >> >> Also, please suggest me if we have a better approach than the above. >> >> >> >> > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // jon@cloudera.com