hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: [ANNOUNCE] Secondary Index in HBase - from Huawei
Date Wed, 14 Aug 2013 16:40:13 GMT
Michael, I do not think its the competitor to Solr, Solr/HBase or Cloudera
Search, but it can be good addition to the HBase SQL front-end, such as
Phoenix .


On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> Guys,
>
> Sorry to be a debbie downer here, but really this is not a good idea.
> Here's why:
>
> In terms of design, you have some serious scalability and performance
> issues when compared to alternatives.
>
>
> Let me try to give you a real life example. *
>
> CCCIS (CCC Information Services) is the middle man in the US between the
> auto repair shop and the insurance company. They have one competitor but
> they handle most of the accident claims in the US.
> So when you go to your authorized repair shop, they have this application
> called Pathways which takes down all of your information and the accident,
> the parts required to be replaced and sends it first to CCC which then
> sends it on to your insurance company. In short CCC collects a lot of
> information about the type of vehicles, the accidents, the cost of parts,
> labor to put your car back on the road.  As the middle man they collect a
> lot of very useful information…
>
> So imagine you have a large data warehouse in HBase of all of the claims.
> Your primary key is going to be a composite of the insurer and the claim_id.
>
> But you're going to want to also index based on the make/model, type of
> accident, driver details, location… , VIN
>
> This will allow your actuaries to figure out the average cost of a front
> end collision, by make and model, by state/zip.
> Or by age bracket, who's a better driver?
>
> Imagine that the claim table will have a column for the claim in its
> entirety  as an Avro doc (JSON) along with the important fields broken out
> separately.  (For this example the schema isn't that important.)
>
> So you want to find the average cost of a front end collision of a VOLVO
> S80 for the past 3 model years.
>
> Now, you have an index based on manufacturer/model/year.
>
> Using your index scheme, you now have to query every RS for the row keys
> in the index.
> Then you have to take these results and then put them in a sort order in
> order to use the index.
>
> Note: This isn't too bad if you're doing a simple query against one index.
> You can do the work by RS and then join the results from all RS.
>
> However… what happens if you have two indexes and your result set is going
> to be the intersection of the indexes?
>
> Or you're going to do a join between two tables using the indexes to limit
> the result set?
>
> Now your design breaks down quickly.
>
> And then there's another problem.
> Your index may be relatively much smaller than your base table.
> In this example… the insurance claim is a huge record.  I would say 2-3
> orders of magnitude  larger than the row key.  Since you split your index
> at the same rate you split your table… you will have a lot of regions for
> your index.
>
> Again,this may lead to other issues….
>
> Is it better than doing a full table scan? Sure.
>
> Are there better alternatives?
> Yes.
> Apply KISS. (Keep it simple)
>
> Still using an inverted table, let HBase manage it rather than trying to
> tie it to the underlying base table.
> While its not perfect, its lighter, and will perform better in the general
> use cases.  (You could even use Async HBase to decouple the write to the
> base table and the update to the index.)
>
> Same model could be applied to a Lucene index as well.
>
> Just Saying….
>
> -Mike
>
>
> *FULL DISCLOSURE
> I am a consultant and CCC was a client of mine back in the late '90s.  In
> one project I worked on ProEFT (now defunct) and an ODS, also now defunct.
>  The example is a hypothetical of what I would do if I were CCC and wanted
> to use Big Data to help manage Auto claims. Any resemblance to any actual
> work being done by CCC in the Big Data space is pure coincidence. ;-)
>
> On Aug 13, 2013, at 1:31 PM, Andrew Purtell <apurtell@apache.org> wrote:
>
> > Thanks so much for the contribution!
> >
> > On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla <
> > rajeshbabu.chintaguntla@huawei.com> wrote:
> >
> >> Hi,
> >>
> >> We have been working on implementing secondary index in HBase, and had
> >> shared an overview of our design in the 2012  Hadoop Technical
> Conference
> >> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source
> it
> >> today.
> >>
> >> The project is available on github.
> >>  https://github.com/Huawei-Hadoop/hindex
> >>
> >> It is 100% Java, compatible with Apache HBase 0.94.8, and is open
> sourced
> >> under Apache Software License v2.
> >>
> >> Following features are supported currently.
> >> -          multiple indexes on table,
> >> -          multi column index,
> >> -          index based on part of a column value,
> >> -          equals and range condition scans using index, and
> >> -          bulk loading data to indexed table (Indexing done with bulk
> >> load)
> >>
> >> We now plan to raise HBase JIRA(s) to make it available in Apache
> release,
> >> and can hopefully continue our work on this in the community.
> >>
> >> Regards
> >> Rajeshbabu
> >>
> >>
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message