hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: Hive metadata on Hbase
Date Wed, 26 Oct 2016 06:52:07 GMT
Hi Mich,

No, I am not using HBase as a metastore now, but I am eager for it to
become production ready and released in CDH and HDP.

Concerning locks, I think HBase would do fine because it is ACID at the row
level. It only appends data on HDFS, but
it works by keeping regions in RAM, plus a write-ahead-log for failure
recovery.
So updates on rows are atomic and ACID.
This allows to have acid guarantees between elements that are stored on the
same row.
Since HBase supports a great number of dynamic columns in each rows
(large-columnar store, like Cassandra), the
smart way to design your tables is quite different from RDBMS.
I would expect that they will have something like a hbase table with one
row per hive table, with all the associated data with it. This would make
all modifications on a table atomic.

Concerning locks, as they involve multiple tables, I guess they would have
to manually put a global lock on the "hbase lock table" before editing it.

I agree that you should not touch the system tables too much, but sometimes
you have to remove the deadlock or fix an inconsistency yourself. I guess
removing deadlocks in HBase should not be much harder, using the
hbase-shell (new syntax to learn, however)

It would be nice if Hive had some syntax to manually remove deadlocks when
they happen, you would not have to worry about the metastore implementation
then.



On Wed, Oct 26, 2016 at 12:58 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi Furcy,
>
> Having used Hbase for part of Batch layer in Lambda Architecture I have
> come to conclusion that it is a very good product despite the fact that
> because of its cryptic nature it is not much loved or appreciated. However,
> it may be useful to have a Hive metastore skin on top of Hbase tables so
> admin and others can interrogate Hbase tables. Definitely there is a need
> for some sort of interface to Hive metastore on Hbase, whether through Hive
> or Phoenix.
>
> Then we still have to handle lock and concurrency on metastore tables.
> RDBMS is transactional and ACID compliant. I do not know enough about
> Hbase. As far as I know Hbase appends data. Currently when I have an issue
> with transactions and locks I go to metadata and do some plastic surgery on
> TRXN and LOCKS tables that resolves the issue. I am not sure how I am going
> to achieve that in Hbase. Puritans might argue that one should not touch
> these system tables but things are not generally that simple.
>
> Are you using Hbase as Hive metastore now?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 October 2016 at 13:44, Furcy Pin <furcy.pin@flaminem.com> wrote:
>
>> Hi Mich,
>>
>> I mostly agree with you, but I would comment on the part about using
>> HBase as a maintenance free core product:
>> I would say that most medium company using Hadoop rely on Hortonworks or
>> Cloudera, that both provides a pre-packaged HBase installation. It would
>> probably make sense for them to ship pre-installed versions of Hive relying
>> on HBase as metastore.
>> And as Alan stated, it would also be a good way to improve the
>> integration between Hive and HBase.
>>
>> I am not well placed to give an opinion on this, but I agree that
>> maintaining integration between both HBase and regular RDBMS might be a
>> real pain.
>> I am also worried about the fact that if indeed HBase grant us the
>> possibility to have all nodes calling the metastore, then any optimization
>> making use
>> of this will only work for a cluster with a Hive metastore on HBase?
>>
>> Anyway, I am still looking forward to this, as despite working in a small
>> company, our metastore sometimes seems to be a bottleneck, especially
>> when running more than 20 queries on tables with 10 000 partitions...
>> But perhaps migrating it on a bigger host would be enough for us...
>>
>>
>>
>> On Mon, Oct 24, 2016 at 10:21 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Thanks Alan for detailed explanation.
>>>
>>> Please bear in mind that any tool that needs to work with some
>>> repository (Oracle TimesTen IMDB has its metastore on Oracle classic),
>>> SAP Replication Server has its repository RSSD on SAP ASE and others
>>> First thing they do, they go and cache those tables and keep it in
>>> memory of the big brother database until they are shutdown. I reversed
>>> engineered and created Hive data model from physical schema (on Oracle).
>>> There are around 194 tables in total that can be easily cached.
>>>
>>> For small medium enterprise (SME), they don't really have much data so
>>> anything will do and they are the ones that use open source databases. For
>>> bigger companies, they already pay bucks for Oracle and alike and they are
>>> the one that would not touch an open source database (not talking about big
>>> data), because in this new capital-sensitive risk-averse world, they do
>>> not want to expose themselves to unnecessary risk.  So I am not sure
>>> whether they will take something like Hbase as a core product, unless it is
>>> going to be maintenance free.
>>>
>>> Going back to your point
>>>
>>> ".. but you have to pay for an expensive commercial license to make the
>>> metadata really work well is a non-starter"
>>>
>>> They already do and pay more if they have to. We will stick with Hive
>>> metadata on Oracle with schema on SSD
>>> .
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 October 2016 at 20:14, Alan Gates <alanfgates@gmail.com> wrote:
>>>
>>>> Some thoughts on this:
>>>>
>>>> First, there’s no plan to remove the option to use an RDBMS such as
>>>> Oracle as your backend.  Hive’s RawStore interface is built such that
>>>> various implementations of the metadata storage can easily coexist.
>>>> Obviously different users will make different choices about what metadata
>>>> store makes sense for them.
>>>>
>>>> As to why HBase:
>>>> 1) We desperately need to get rid of the ORM layer.  It’s causing us
>>>> performance problems, as evidenced by things like it taking several minutes
>>>> to fetch all of the partition data for queries that span many partitions.
>>>> HBase is a way to achieve this, not the only way.  See in particular
>>>> Yahoo’s work on optimizing Oracle access https://issues.apache.org/jira
>>>> /browse/HIVE-14870  The question around this is whether we can
>>>> optimize for Oracle, MySQL, Postgres, and SQLServer without creating a
>>>> maintenance and testing nightmare for ourselves.  I’m skeptical, but others
>>>> think it’s possible.  See comments on that JIRA.
>>>>
>>>> 2) We’d like to scale to much larger sizes, both in terms of data and
>>>> access from nodes.  Not that we’re worried about the amount of metadata,
>>>> but we’d like to be able to cache more stats, file splits, etc.  And we’d
>>>> like to allow nodes in the cluster to contact the metastore, which we do
>>>> not today since many RDBMSs don’t handle a thousand plus simultaneous
>>>> connections well.  Obviously both data and connection scale can be met with
>>>> high end commercial stores.  But saying that we have this great open source
>>>> database but you have to pay for an expensive commercial license to make
>>>> the metadata really work well is a non-starter.
>>>>
>>>> 3) By using tools within the Hadoop ecosystem like HBase we are helping
>>>> to drive improvement in the system
>>>>
>>>> To explain the HBase work a little more, it doesn’t use Phoenix, but
>>>> works directly against HBase, with the help of a transaction manager
>>>> (Omid).  In performance tests we’ve done so far it’s faster than Hive
1
>>>> with the ORM layer, but not yet to the 10x range that we’d like to see.
 We
>>>> haven’t yet done the work to put in co-processors and such that we expect
>>>> would speed it up further.
>>>>
>>>> Alan.
>>>>
>>>> > On Oct 23, 2016, at 15:46, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>> wrote:
>>>> >
>>>> >
>>>> > A while back there was some notes on having Hive metastore on Hbase
>>>> as opposed to conventional RDBMSs
>>>> >
>>>> > I am currently involved with some hefty work with Hbase and Phoenix
>>>> for batch ingestion of trade data. As long as you define your Hbase table
>>>> through Phoenix and with secondary Phoenix indexes on Hbase, the speed is
>>>> impressive.
>>>> >
>>>> > I am not sure how much having Hbase as Hive metastore is going to add
>>>> to Hive performance. We use Oracle 12c as Hive metastore and the Hive
>>>> database/schema is built on solid state disks. Never had any issues with
>>>> lock and concurrency.
>>>> >
>>>> > Therefore I am not sure what one is going to gain by having Hbase as
>>>> the Hive metastore? I trust that we can still use our existing schemas on
>>>> Oracle.
>>>> >
>>>> > HTH
>>>> >
>>>> >
>>>> >
>>>> > Dr Mich Talebzadeh
>>>> >
>>>> > LinkedIn  https://www.linkedin.com/profi
>>>> le/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> >
>>>> > http://talebzadehmich.wordpress.com
>>>> >
>>>> > Disclaimer: Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>> >
>>>>
>>>>
>>>
>>
>

Mime
View raw message