Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com
 designates 209.85.214.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAha9a383wNBXcQCQW2yzKZ-6LaZfbq5pWWh8ctcyVxGmzXWSw@mail.gmail.com>
References: <001b01ccdf55$d5a55c20$80f01460$@com>
	<CAAha9a383wNBXcQCQW2yzKZ-6LaZfbq5pWWh8ctcyVxGmzXWSw@mail.gmail.com>
Date: Tue, 31 Jan 2012 09:19:01 +0100
Message-ID: 
 <CAMsy_NgVrUDf4xtoVmKBLBmG9D8=_pHw5ToO0YH=Mn+T8fnRDA@mail.gmail.com>
Subject: Re: Faster Bulkload from Oracle to HBase
From: Tim Robertson <timrobertson100@gmail.com>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Laxman,

We use both #1 and #3 from MySQL which also has hi speed exports.
For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it
is closer to 8 hrs to our 3 node cluster.

We are having issues with delimiters though (since we have \r, \t and
\n in the database), and now using Avro as a compression to overcome
these.  A colleague of mine is currently doing 2 patches for Sqoop to
fix some bugs in this.

For the majority of our processing we don't use HBase but straight to
HDFS, and we use Sqoop wrapping it all up in Oozie workflows which is
working very nicely.  We do a lot of ETL processing from one DB to
another through an Oozie workflow having Sqoop and Hive.

HTH,
Tim


On Tue, Jan 31, 2012 at 7:21 AM, Jonathan Hsieh <jon@cloudera.com> wrote:
> Hi Laxman,
>
> I'm an Apache HBase and Sqoop committer. =A0I haven't run the comparison
> you've suggested but my first thoughts are to consider #1 and #3.
>
> Case 1 will natively export data which should be the fastest way to get
> data out of Oracle. =A0You many need to do some reformatting to use HBase=
's
> importtsv.
>
> Case 2, writing your own DBInputFormat, is essentially duplicating what
> Sqoop does in the generic case. =A0Here it is essentially executing sql
> queries against the rdbms to get data out which will likely be a bit slow=
er
> than a database's native bulk export feature.
>
> Case 3, consider using Apache Sqoop in conjunction with Quest's data
> connector for oracle and hadoop. =A0It is a free as in beer plugin to sqo=
op!
> =A0This is probably the fastest from a getting started point of view (no =
dev
> time) but may not be as performant as #1. =A0I'd give it a try.
> http://www.quest.com/data-connector-for-oracle-and-hadoop/
>
> If you have more questions about Sqoop, feel free to cross post to
> sqoop-user@incubator.apache.org!
>
> Jon.
>
> On Mon, Jan 30, 2012 at 5:48 AM, Laxman <lakshman.ch@huawei.com> wrote:
>
>> Hi,
>>
>> We have the following use-case.
>>
>> We have data in relational database (Oracle).
>> We need to export this data to HBase and perform analysis on this data.
>> We need to perform this export-import 500G periodically, say every month=
.
>>
>> Following are the different approaches I can see as per my knowledge.
>> Before testing and finding out the best way by myself, I wanted to liste=
n
>> from the experts here.
>>
>> Approach #1
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> 1) Export from Oracle to raw text file (Using Oracle export utility -
>> Faster
>> - Involves no transactional overhead)
>>
>> 2) Upload text file to HDFS
>>
>> 3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad())
>>
>> Approach #2
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> 1) Write a custom Job using DBInputFormat to directly read from database=
.
>> =A0 =A0 =A0 =A0- Just a thought to avoid multiple hops(Oracle to Local F=
S, Local FS
>> to HDFS, HDFS to HBase) involved in approach #1.
>>
>> 2) Use the HBase bulk load tool to load this data to
>> HBase.(HFileOutputFormat.configureIncrementalLoad())
>>
>> Approach #3
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> 1) Use Apache Sqoop (Currently under incubation) to achieve my requireme=
nt.
>> =A0 =A0 =A0 =A0- I'm not aware of the istability of this.
>>
>> Also, please suggest me if we have a better approach than the above.
>>
>>
>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com