hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rose, Joseph" <Joseph.R...@childrens.harvard.edu>
Subject Re: Status of Huawei's 2' Indexing?
Date Mon, 16 Mar 2015 18:51:20 GMT
Thanks, Wilm. I’ll look for the thread there.

Obviously I didn’t realize there was so much back story: I was asking
about this specific implementation because it seems to be fairly well
thought out and have good commentary in the Jira ticket (HBASE-9203). At
the time I thought it was mostly a dev concern. I think we’ve moved on, as
you pointed out.

I'd be happy to contribute to hbase if I have something to offer. I’m just
starting with this, so let’s see where it takes us.

For those of you joining us late, you can find the continuation here:


On 3/16/15, 2:09 PM, "Wilm Schumacher" <wilm.schumacher@gmail.com> wrote:

>Hi Joseph,
>I think that you kicked off this discussion, because to implement an
>indexing mechanism for hbase in general is much more complicate than
>your specific problem. The people on this list want to bear every
>possible (or at least A LOT) of applications in mind. A too easy
>mechanism wouldn't fit the needs of most of the users (thus would be
>useless), a more complicate model is harder to maintain and you would
>have to find more coders etc.. Thus with your application question you
>seemed to walked right into a very general discussion.
>Furthermore this is a user question, as you do not want to change the
>code of hbase, aren't you ;). I'll try an answer on the general user
>list in a couple of minutes, thus more people can discuss and we can get
>traffic out of this list, okay?
>Best wishes
>Am 16.03.2015 um 18:46 schrieb Rose, Joseph:
>> Alright, let’s see if I can get this discussion back on track.
>> I have a sensibly defined table for patient data; its rowkey is simply
>> lastname:firstname, since it’s convenient for the bulk of my lookups.
>> Unfortunately I also need to efficiently find patients using an ID
>> whose literal value is buried in a value field. I’m sure this situation
>> not foreign to the people on this list.
>> It’s been suggested that I implement 2’ indexes myself — fine. All the
>> research I’ve done seems to end with that suggestion, with the exception
>> of Phoenix (I don’t want the RDBMS layer) and Huawei’s stuff (which
>> to incite some discussion here). I’m happy to put this together but I’d
>> rather go with something that has been vetted and has a larger developer
>> community than one (i.e., ME). Besides, I have a full enough plate at
>> moment that I’d rather not have to do this, too.
>> Are there constructive suggestions regarding how I can proceed with
>> Right now even a well-vetted local index would be a godsend.
>> Thanks.
>> -j
>> p.s., I’ll refer you to this post for a slightly more detailed rundown
>> how I plan to do things:
>> On 3/16/15, 12:18 PM, "Michael Segel" <michael_segel@hotmail.com> wrote:
>>> Joseph, 
>>> The issue with Andrew goes back a few years.  His comment about having
>>> civilized discussion was a personal dig at me.
>>>> On Mar 16, 2015, at 10:38 AM, Rose, Joseph
>>>> <Joseph.Rose@childrens.harvard.edu> wrote:
>>>> Michael,
>>>> I don’t understand the invective. I’m sure you have something to
>>>> contribute but when bring on this tone the only thing I hear are the
>>>> snide
>>>> comments.
>>>> -j
>>>> P.s., I’ll refer you to this:
>>>> PqYKJXUqAjNk&e=
>>>> On 3/16/15, 11:15 AM, "Michael Segel" <michael_segel@hotmail.com>
>>>>> You’ll have to excuse Andy.
>>>>> He’s a bit slow.  HBASE-13044 should have been done 2 years ago. And
>>>>> was trivial. Just got done last month….
>>>>> But I digress… The long story short…
>>>>> HBASE-9203 was brain dead from inception.  Huawei’s idea was to index
>>>>> on
>>>>> the region which had two problems.
>>>>> 1) Complexity in that they wanted to keep the index on the same
>>>>> server
>>>>> 2) Joins become impossible.  Well, actually not impossible, but
>>>>> incredibly slow when compared to the alternative.
>>>>> You really should go back to the email chain.
>>>>> Their defense (including Salesforce who was going to push this
>>>>> approach)
>>>>> fell apart when you asked the simple question on how do you handle
>>>>> joins?
>>>>> That’s their OOPS moment. Once you start to understand that, then
>>>>> allowing the index to be orthogonal to the base table, things started
>>>>> to
>>>>> come together.
>>>>> In short, you have a query either against a single table, or if
>>>>> doing a join.  You then get the indexes and assuming that you’re only
>>>>> using the AND predicate, its a simple intersection of the index
>>>>> sets. (Since the result sets are ordered, its relatively trivial to
>>>>> walk
>>>>> through and find the intersections of N Lists in a single pass.)
>>>>> Now you have your result set of base table row keys and you can work
>>>>> with
>>>>> that data. (Either returning the records to the client, or as input
>>>>> a
>>>>> map/reduce job.
>>>>> That’s the 30K view.  There’s more to it, but once Salesforce got
>>>>> basic idea, they ran with it. It was really that simple concept that
>>>>> the
>>>>> index would be orthogonal to the base table that got them moving in
>>>>> right direction.
>>>>> To Joseph’s point, indexing isn’t necessarily an RDBMS feature.
>>>>> However,
>>>>> it seems that some of the Committers are suffering from rectal
>>>>> hypoxia. HBASE-12853 was created not just to help solve the issue of
>>>>> ‘hot
>>>>> spotting’ but also to get the Committers to focus on bringing the
>>>>> solutions that they glum on in the client, back to the server side of
>>>>> things. 
>>>>> Unfortunately the last great attempt at fixing things on the server
>>>>> side
>>>>> was the bastardization of coprocessors which again, suffers from the
>>>>> lack
>>>>> of thought.  This isn’t to say that allowing users to extend the
>>>>> side functionality is wrong. (Because it isn’t.) But that the
>>>>> implementation done in HBase is a tad lacking in thought.
>>>>> So in terms of indexing…
>>>>> Longer term picture, there has to be some fixes on the server side of
>>>>> things to allow one to associate an index (allowing for different
>>>>> types)
>>>>> to a base table, yet the implementation of using the index would end
>>>>> becoming a client.  And by client, it would be an external query
>>>>> processor that could/should sit on the cluster.
>>>>> But hey! What do I know?
>>>>> I gave up trying to have an intelligent/civilized conversation with
>>>>> Andrew because he just couldn’t grasp the basics.  ;-)
>>>>>> On Mar 13, 2015, at 4:14 PM, Andrew Purtell <apurtell@apache.org>
>>>>>> wrote:
>>>>>> When I made that remark I was thinking of a recent discussion we
>>>>>> at
>>>>>> a
>>>>>> joint Phoenix and HBase developer meetup. The difference of opinion
>>>>>> was
>>>>>> certainly civilized. (smile) I'm not aware of any specific written
>>>>>> discussion, it may or may not exist. I'm pretty sure a revival of
>>>>>> HBASE-9203
>>>>>> would attract some controversy, but let me be clearer this time
>>>>>>than I
>>>>>> was
>>>>>> before that this is just my opinion, FWIW.
>>>>>> On Thu, Mar 12, 2015 at 3:58 PM, Rose, Joseph <
>>>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>>>>> I saw that it was added to their project. I’m really not keen
>>>>>>> bringing
>>>>>>> in all the RDBMS apparatus on top of hbase, so I decided to follow
>>>>>>> other
>>>>>>> avenues first (like trying to patch 0.98, for better or worse.)
>>>>>>> That Phoenix article seems like a good breakdown of the various
>>>>>>> indexing
>>>>>>> architectures.
>>>>>>> HBASE-9203 (the ticket that deals with 2’ indexes) is pretty
>>>>>>> civilized
>>>>>>> (as
>>>>>>> are most of them, it seems) so I didn’t know there were these
>>>>>>> differences
>>>>>>> of opinion. Did I miss the mailing list thread where the
>>>>>>> architectural
>>>>>>> differences were discussed?
>>>>>>> -j
>>>>> The opinions expressed here are mine, while they may reflect a
>>>>> cognitive
>>>>> thought, that is purely accidental.
>>>>> Use at your own risk.
>>>>> Michael Segel
>>>>> michael_segel (AT) hotmail.com
>>> The opinions expressed here are mine, while they may reflect a
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com

View raw message