hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Demai Ni <nid...@gmail.com>
Subject Re: ETL HBase HFile+HLog to ORC(or Parquet) file?
Date Sun, 23 Oct 2016 03:33:40 GMT
Jerry and Mich,

thanks. I will look a bit more into this. probably an interesting and
useful feature to have.

Demai

On Sat, Oct 22, 2016 at 12:02 PM, Jerry He <jerryjch@gmail.com> wrote:

> Hi, Demai
>
> If you think something helpful can be done within HBase, feel free to
> propose on the JIRA.
>
> Jerry
>
> On Fri, Oct 21, 2016 at 2:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Hi Demai,
> >
> > As I understand you want to use Hbase as the real time layer and Hive
> Data
> > Warehouse as the batch layer for analytics.
> >
> > In other words ingest data real time from source into Hbase and push that
> > data into Hive recurring
> >
> > If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
> > into this table using Spark as the execution engine for Hive (as opposed
> to
> > map-reduce) it should pretty fast.
> >
> > Hive is going to get an in-memory database in the next release or so it
> is
> > a perfect choice.
> >
> >
> > HTH
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 22:28, Demai Ni <nidmgg@gmail.com> wrote:
> >
> > > Mich,
> > >
> > > thanks for the detail instructions.
> > >
> > > While aware of the Hive method, I have a few questions/concerns:
> > > 1) the Hive method is a "INSERT FROM SELECT " ,which usually not
> perform
> > as
> > > good as a bulk load though I am not familiar with the real
> implementation
> > > 2) I have another SQL-on-Hadoop engine working well with ORC file. So
> if
> > > possible, I'd like to avoid the system dependency on Hive(one fewer
> > > component to maintain).
> > > 3) HBase has well running back-end process for Replication(HBASE-1295)
> or
> > > Backup(HBASE-7912), so  wondering anything can be piggy-back on it to
> > deal
> > > with day-to-day works
> > >
> > > The goal is to have HBase as a OLTP front(to receive data), and the ORC
> > > file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> > > file will also serve as my backup in the case for DR.
> > >
> > > Demai
> > >
> > >
> > > On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> > > mich.talebzadeh@gmail.com>
> > > wrote:
> > >
> > > > Create an external table in Hive on Hbase atble. Pretty straight
> > forward.
> > > >
> > > > hive>  create external table marketDataHbase (key STRING, ticker
> > STRING,
> > > > timecreated STRING, price STRING)
> > > >
> > > >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > > > SERDEPROPERTIES ("hbase.columns.mapping" =
> > > > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> > > >
> > > >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> > > >
> > > >
> > > >
> > > > then create a normal table in hive as ORC
> > > >
> > > >
> > > > CREATE TABLE IF NOT EXISTS marketData (
> > > >      KEY string
> > > >    , TICKER string
> > > >    , TIMECREATED string
> > > >    , PRICE float
> > > > )
> > > > PARTITIONED BY (DateStamp  string)
> > > > STORED AS ORC
> > > > TBLPROPERTIES (
> > > > "orc.create.index"="true",
> > > > "orc.bloom.filter.columns"="KEY",
> > > > "orc.bloom.filter.fpp"="0.05",
> > > > "orc.compress"="SNAPPY",
> > > > "orc.stripe.size"="16777216",
> > > > "orc.row.index.stride"="10000" )
> > > > ;
> > > > --show create table marketData;
> > > > --Populate target table
> > > > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > > > SELECT
> > > >       KEY
> > > >     , TICKER
> > > >     , TIMECREATED
> > > >     , PRICE
> > > > FROM MarketDataHbase
> > > >
> > > >
> > > > Run this job as a cron every often
> > > >
> > > >
> > > > HTH
> > > >
> > > >
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > > >
> > > >
> > > > On 21 October 2016 at 21:48, Demai Ni <nidmgg@gmail.com> wrote:
> > > >
> > > > > hi,
> > > > >
> > > > > I am wondering whether there are existing methods to ETL HBase data
> > to
> > > > > ORC(or other open source columnar) file?
> > > > >
> > > > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > > > HIVE_HBase_Table", can probably get the job done. Is this the
> common
> > > way
> > > > to
> > > > > do so? Performance is acceptable and able to handle the delta
> update
> > in
> > > > the
> > > > > case HBase table changed?
> > > > >
> > > > > I did a bit google, and find this
> > > > > https://community.hortonworks.com/questions/2632/loading-
> > > > > hbase-from-hive-orc-tables.html
> > > > >
> > > > > which is another way around.
> > > > >
> > > > > Will it perform better(comparing to above Hive stmt) if using
> either
> > > > > replication logic or snapshot backup to generate ORC file from
> hbase
> > > > tables
> > > > > and with incremental update ability?
> > > > >
> > > > > I hope to has as fewer dependency as possible. in the Example of
> ORC,
> > > > will
> > > > > only depend on Apache ORC's API, and not depend on Hive
> > > > >
> > > > > Demai
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message