hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: ETL HBase HFile+HLog to ORC(or Parquet) file?
Date Fri, 21 Oct 2016 21:41:57 GMT
Hi Demai,

As I understand you want to use Hbase as the real time layer and Hive Data
Warehouse as the batch layer for analytics.

In other words ingest data real time from source into Hbase and push that
data into Hive recurring

If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
into this table using Spark as the execution engine for Hive (as opposed to
map-reduce) it should pretty fast.

Hive is going to get an in-memory database in the next release or so it is
a perfect choice.


HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 22:28, Demai Ni <nidmgg@gmail.com> wrote:

> Mich,
>
> thanks for the detail instructions.
>
> While aware of the Hive method, I have a few questions/concerns:
> 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
> good as a bulk load though I am not familiar with the real implementation
> 2) I have another SQL-on-Hadoop engine working well with ORC file. So if
> possible, I'd like to avoid the system dependency on Hive(one fewer
> component to maintain).
> 3) HBase has well running back-end process for Replication(HBASE-1295) or
> Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
> with day-to-day works
>
> The goal is to have HBase as a OLTP front(to receive data), and the ORC
> file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> file will also serve as my backup in the case for DR.
>
> Demai
>
>
> On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Create an external table in Hive on Hbase atble. Pretty straight forward.
> >
> > hive>  create external table marketDataHbase (key STRING, ticker STRING,
> > timecreated STRING, price STRING)
> >
> >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" =
> > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> >
> >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> >
> >
> >
> > then create a normal table in hive as ORC
> >
> >
> > CREATE TABLE IF NOT EXISTS marketData (
> >      KEY string
> >    , TICKER string
> >    , TIMECREATED string
> >    , PRICE float
> > )
> > PARTITIONED BY (DateStamp  string)
> > STORED AS ORC
> > TBLPROPERTIES (
> > "orc.create.index"="true",
> > "orc.bloom.filter.columns"="KEY",
> > "orc.bloom.filter.fpp"="0.05",
> > "orc.compress"="SNAPPY",
> > "orc.stripe.size"="16777216",
> > "orc.row.index.stride"="10000" )
> > ;
> > --show create table marketData;
> > --Populate target table
> > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > SELECT
> >       KEY
> >     , TICKER
> >     , TIMECREATED
> >     , PRICE
> > FROM MarketDataHbase
> >
> >
> > Run this job as a cron every often
> >
> >
> > HTH
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 21:48, Demai Ni <nidmgg@gmail.com> wrote:
> >
> > > hi,
> > >
> > > I am wondering whether there are existing methods to ETL HBase data to
> > > ORC(or other open source columnar) file?
> > >
> > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > HIVE_HBase_Table", can probably get the job done. Is this the common
> way
> > to
> > > do so? Performance is acceptable and able to handle the delta update in
> > the
> > > case HBase table changed?
> > >
> > > I did a bit google, and find this
> > > https://community.hortonworks.com/questions/2632/loading-
> > > hbase-from-hive-orc-tables.html
> > >
> > > which is another way around.
> > >
> > > Will it perform better(comparing to above Hive stmt) if using either
> > > replication logic or snapshot backup to generate ORC file from hbase
> > tables
> > > and with incremental update ability?
> > >
> > > I hope to has as fewer dependency as possible. in the Example of ORC,
> > will
> > > only depend on Apache ORC's API, and not depend on Hive
> > >
> > > Demai
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message