hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: [ANNOUNCE] Secondary Index in HBase - from Huawei
Date Wed, 14 Aug 2013 19:53:58 GMT
Vladimir,

I wasn't talking about anything outside of HBase. 

The point I was trying to make was that if you are going to use an inverted table as your
index, managing your index at the RS level is going to bite you in the ass and will cause
more headaches down the road. 

This is being done because they want to avoid the overhead of RPC calls. But you're in a distributed
database where RPC is part of the ecosystem and its something that you have to deal with.
(And you can do some basic design to decouple the write to the index from the base table.
) 

In addition to this, the use of an inverted table is just one of the options you have for
a secondary index. You could also look at Lucene which we did a PoC a few years back. 

Also beyond the secondary indexing, you have issues with coprocessors in general that should
be addressed. 
But that's a different story. 

Please don't misunderstand, but while secondary indexing is a very important thing, going
down the path of tying the index to the region is going down the wrong path.  

When you look at trying to integrate it in to Phoenix, you'll start to see the problems….

Hint: 

Select * from tbl_foo where foo.A == Something And foo.B == SomethingElse

This is still pretty straight forward since you can take the sort ordered intersection by
RS. 

But then if you have the following:

SELECT *
FROM   tbl_foo , tbl_bar
WHERE tbl_foo.A == tbl_bar.A
AND       tbl_foo.C == Something
AND       tbl_bar.X == Something_Else

And you have indexes on A, C and X

That's actually 4 indexes. tbl_foo.A , tbl_foo.C , tbl_bar.A and tbl_bar.X

And here's the rub. You need to find the intersection of the complete index sets, not just
on each node in order to do the join. 

You need each of the indexes in sort order. 

I'm not saying that you can't use the proposed solution, but that you will take a performance
hit on the reads. 

-Just saying…


On Aug 14, 2013, at 11:40 AM, Vladimir Rodionov <vladrodionov@gmail.com> wrote:

> Michael, I do not think its the competitor to Solr, Solr/HBase or Cloudera
> Search, but it can be good addition to the HBase SQL front-end, such as
> Phoenix .
> 
> 
> On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <michael_segel@hotmail.com>wrote:
> 
>> Guys,
>> 
>> Sorry to be a debbie downer here, but really this is not a good idea.
>> Here's why:
>> 
>> In terms of design, you have some serious scalability and performance
>> issues when compared to alternatives.
>> 
>> 
>> Let me try to give you a real life example. *
>> 
>> CCCIS (CCC Information Services) is the middle man in the US between the
>> auto repair shop and the insurance company. They have one competitor but
>> they handle most of the accident claims in the US.
>> So when you go to your authorized repair shop, they have this application
>> called Pathways which takes down all of your information and the accident,
>> the parts required to be replaced and sends it first to CCC which then
>> sends it on to your insurance company. In short CCC collects a lot of
>> information about the type of vehicles, the accidents, the cost of parts,
>> labor to put your car back on the road.  As the middle man they collect a
>> lot of very useful information…
>> 
>> So imagine you have a large data warehouse in HBase of all of the claims.
>> Your primary key is going to be a composite of the insurer and the claim_id.
>> 
>> But you're going to want to also index based on the make/model, type of
>> accident, driver details, location… , VIN
>> 
>> This will allow your actuaries to figure out the average cost of a front
>> end collision, by make and model, by state/zip.
>> Or by age bracket, who's a better driver?
>> 
>> Imagine that the claim table will have a column for the claim in its
>> entirety  as an Avro doc (JSON) along with the important fields broken out
>> separately.  (For this example the schema isn't that important.)
>> 
>> So you want to find the average cost of a front end collision of a VOLVO
>> S80 for the past 3 model years.
>> 
>> Now, you have an index based on manufacturer/model/year.
>> 
>> Using your index scheme, you now have to query every RS for the row keys
>> in the index.
>> Then you have to take these results and then put them in a sort order in
>> order to use the index.
>> 
>> Note: This isn't too bad if you're doing a simple query against one index.
>> You can do the work by RS and then join the results from all RS.
>> 
>> However… what happens if you have two indexes and your result set is going
>> to be the intersection of the indexes?
>> 
>> Or you're going to do a join between two tables using the indexes to limit
>> the result set?
>> 
>> Now your design breaks down quickly.
>> 
>> And then there's another problem.
>> Your index may be relatively much smaller than your base table.
>> In this example… the insurance claim is a huge record.  I would say 2-3
>> orders of magnitude  larger than the row key.  Since you split your index
>> at the same rate you split your table… you will have a lot of regions for
>> your index.
>> 
>> Again,this may lead to other issues….
>> 
>> Is it better than doing a full table scan? Sure.
>> 
>> Are there better alternatives?
>> Yes.
>> Apply KISS. (Keep it simple)
>> 
>> Still using an inverted table, let HBase manage it rather than trying to
>> tie it to the underlying base table.
>> While its not perfect, its lighter, and will perform better in the general
>> use cases.  (You could even use Async HBase to decouple the write to the
>> base table and the update to the index.)
>> 
>> Same model could be applied to a Lucene index as well.
>> 
>> Just Saying….
>> 
>> -Mike
>> 
>> 
>> *FULL DISCLOSURE
>> I am a consultant and CCC was a client of mine back in the late '90s.  In
>> one project I worked on ProEFT (now defunct) and an ODS, also now defunct.
>> The example is a hypothetical of what I would do if I were CCC and wanted
>> to use Big Data to help manage Auto claims. Any resemblance to any actual
>> work being done by CCC in the Big Data space is pure coincidence. ;-)
>> 
>> On Aug 13, 2013, at 1:31 PM, Andrew Purtell <apurtell@apache.org> wrote:
>> 
>>> Thanks so much for the contribution!
>>> 
>>> On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla <
>>> rajeshbabu.chintaguntla@huawei.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We have been working on implementing secondary index in HBase, and had
>>>> shared an overview of our design in the 2012  Hadoop Technical
>> Conference
>>>> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source
>> it
>>>> today.
>>>> 
>>>> The project is available on github.
>>>> https://github.com/Huawei-Hadoop/hindex
>>>> 
>>>> It is 100% Java, compatible with Apache HBase 0.94.8, and is open
>> sourced
>>>> under Apache Software License v2.
>>>> 
>>>> Following features are supported currently.
>>>> -          multiple indexes on table,
>>>> -          multi column index,
>>>> -          index based on part of a column value,
>>>> -          equals and range condition scans using index, and
>>>> -          bulk loading data to indexed table (Indexing done with bulk
>>>> load)
>>>> 
>>>> We now plan to raise HBase JIRA(s) to make it available in Apache
>> release,
>>>> and can hopefully continue our work on this in the community.
>>>> 
>>>> Regards
>>>> Rajeshbabu
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> 
>>>  - Andy
>>> 
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>> 
>> 


Mime
View raw message