hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry He <jerry...@gmail.com>
Subject Re: ETL HBase HFile+HLog to ORC(or Parquet) file?
Date Sat, 22 Oct 2016 19:02:02 GMT
Hi, Demai

If you think something helpful can be done within HBase, feel free to
propose on the JIRA.

Jerry

On Fri, Oct 21, 2016 at 2:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi Demai,
>
> As I understand you want to use Hbase as the real time layer and Hive Data
> Warehouse as the batch layer for analytics.
>
> In other words ingest data real time from source into Hbase and push that
> data into Hive recurring
>
> If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
> into this table using Spark as the execution engine for Hive (as opposed to
> map-reduce) it should pretty fast.
>
> Hive is going to get an in-memory database in the next release or so it is
> a perfect choice.
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 22:28, Demai Ni <nidmgg@gmail.com> wrote:
>
> > Mich,
> >
> > thanks for the detail instructions.
> >
> > While aware of the Hive method, I have a few questions/concerns:
> > 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform
> as
> > good as a bulk load though I am not familiar with the real implementation
> > 2) I have another SQL-on-Hadoop engine working well with ORC file. So if
> > possible, I'd like to avoid the system dependency on Hive(one fewer
> > component to maintain).
> > 3) HBase has well running back-end process for Replication(HBASE-1295) or
> > Backup(HBASE-7912), so  wondering anything can be piggy-back on it to
> deal
> > with day-to-day works
> >
> > The goal is to have HBase as a OLTP front(to receive data), and the ORC
> > file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> > file will also serve as my backup in the case for DR.
> >
> > Demai
> >
> >
> > On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> > mich.talebzadeh@gmail.com>
> > wrote:
> >
> > > Create an external table in Hive on Hbase atble. Pretty straight
> forward.
> > >
> > > hive>  create external table marketDataHbase (key STRING, ticker
> STRING,
> > > timecreated STRING, price STRING)
> > >
> > >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > > SERDEPROPERTIES ("hbase.columns.mapping" =
> > > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> > >
> > >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> > >
> > >
> > >
> > > then create a normal table in hive as ORC
> > >
> > >
> > > CREATE TABLE IF NOT EXISTS marketData (
> > >      KEY string
> > >    , TICKER string
> > >    , TIMECREATED string
> > >    , PRICE float
> > > )
> > > PARTITIONED BY (DateStamp  string)
> > > STORED AS ORC
> > > TBLPROPERTIES (
> > > "orc.create.index"="true",
> > > "orc.bloom.filter.columns"="KEY",
> > > "orc.bloom.filter.fpp"="0.05",
> > > "orc.compress"="SNAPPY",
> > > "orc.stripe.size"="16777216",
> > > "orc.row.index.stride"="10000" )
> > > ;
> > > --show create table marketData;
> > > --Populate target table
> > > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > > SELECT
> > >       KEY
> > >     , TICKER
> > >     , TIMECREATED
> > >     , PRICE
> > > FROM MarketDataHbase
> > >
> > >
> > > Run this job as a cron every often
> > >
> > >
> > > HTH
> > >
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 21:48, Demai Ni <nidmgg@gmail.com> wrote:
> > >
> > > > hi,
> > > >
> > > > I am wondering whether there are existing methods to ETL HBase data
> to
> > > > ORC(or other open source columnar) file?
> > > >
> > > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > > HIVE_HBase_Table", can probably get the job done. Is this the common
> > way
> > > to
> > > > do so? Performance is acceptable and able to handle the delta update
> in
> > > the
> > > > case HBase table changed?
> > > >
> > > > I did a bit google, and find this
> > > > https://community.hortonworks.com/questions/2632/loading-
> > > > hbase-from-hive-orc-tables.html
> > > >
> > > > which is another way around.
> > > >
> > > > Will it perform better(comparing to above Hive stmt) if using either
> > > > replication logic or snapshot backup to generate ORC file from hbase
> > > tables
> > > > and with incremental update ability?
> > > >
> > > > I hope to has as fewer dependency as possible. in the Example of ORC,
> > > will
> > > > only depend on Apache ORC's API, and not depend on Hive
> > > >
> > > > Demai
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message