hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laxman <lakshman...@huawei.com>
Subject RE: Faster Bulkload from Oracle to HBase
Date Wed, 01 Feb 2012 09:49:49 GMT
Jon & Tim, Thanks a lot for your responses. That was really helpful

We will do a comparison and publish our findings.


>> For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it is
closer to 8 hrs to our 3 node cluster.


@Tim, Can you please let me know whether you use Sqoop in conjunction with
Quest's  data connector.

Quest claims 5x improvement with their connector.


@Jon, do you feel Approach #3 with Quest's connector may out-perform #1?



From: Jonathan Hsieh [mailto:jon@cloudera.com] 
Sent: Tuesday, January 31, 2012 11:51 AM
To: user@hbase.apache.org; lakshman.ch@huawei.com
Cc: Arvind Prabhakar
Subject: Re: Faster Bulkload from Oracle to HBase


Hi Laxman,


I'm an Apache HBase and Sqoop committer.  I haven't run the comparison
you've suggested but my first thoughts are to consider #1 and #3.  


Case 1 will natively export data which should be the fastest way to get data
out of Oracle.  You many need to do some reformatting to use HBase's


Case 2, writing your own DBInputFormat, is essentially duplicating what
Sqoop does in the generic case.  Here it is essentially executing sql
queries against the rdbms to get data out which will likely be a bit slower
than a database's native bulk export feature.  


Case 3, consider using Apache Sqoop in conjunction with Quest's data
connector for oracle and hadoop.  It is a free as in beer plugin to sqoop!
This is probably the fastest from a getting started point of view (no dev
time) but may not be as performant as #1.  I'd give it a try.


If you have more questions about Sqoop, feel free to cross post to



On Mon, Jan 30, 2012 at 5:48 AM, Laxman <lakshman.ch@huawei.com> wrote:


We have the following use-case.

We have data in relational database (Oracle).
We need to export this data to HBase and perform analysis on this data.
We need to perform this export-import 500G periodically, say every month.

Following are the different approaches I can see as per my knowledge.
Before testing and finding out the best way by myself, I wanted to listen
from the experts here.

Approach #1
1) Export from Oracle to raw text file (Using Oracle export utility - Faster
- Involves no transactional overhead)

2) Upload text file to HDFS

3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())

Approach #2
1) Write a custom Job using DBInputFormat to directly read from database.
       - Just a thought to avoid multiple hops(Oracle to Local FS, Local FS
to HDFS, HDFS to HBase) involved in approach #1.

2) Use the HBase bulk load tool to load this data to

Approach #3
1) Use Apache Sqoop (Currently under incubation) to achieve my requirement.
       - I'm not aware of the istability of this.

Also, please suggest me if we have a better approach than the above.


// Jonathan Hsieh (shay)
// Software Engineer, Cloudera

// jon@cloudera.com



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message