hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Design review: Secondary index support through coprocess
Date Mon, 20 Jan 2014 16:50:46 GMT
Indexes tend to be orthogonal to the base table, not to mention if you’re using an inverted
table for an index, your index table would be much thinner than your base table. 

Having said that, the solution proposed by Yu, Taylor and others only works if you want to
use the index to help on server side filtering and misses the boat on the larger and broader
picture of improving query optimization and joins. 

HINT: Unless I am mistaken… until you treat the index as orthogonal to the base table, you
will always lag performance of traditional MPP DWs like Informix XPS. (Now part of IBM’s
IM pillar )

In addition, until you fix coprocessors in general, you will have scalability and performance
(Note that you can write a coprocessor to create a sandbox and separate the co-process from
the RS jvm, however it would be better if it were part of the underlying coprocessor code.

The current implementation makes joins worthless.
(Note that in prior discussions,  Phoenix doesn’t do joins…) 
Here’s why:
In order to do a join, if you use the proposed index, you have to first reduce each index
in to a single, sort ordered set.  Then you can take the intersection of the index result
sets.  The final set would be in sort order and a subset of the total rows. You can then fetch
the rows and still do a server side filter before returning the ultimate result set.

Its that first step of reducing each result set in to a single sort ordered set that takes
a lot of effort. 

On a side note…. there’s been some mention of ordering floats. Again, just a word of caution…
there isn’t a really strong use case for indexing numeric data types. period.  And to be
very, very clear, there is a distinction between numeric strings and numeric data types. 


PS. Because of my role as a consultant, I am very, very limited in what I can say and contribute.
I don’t own my work product, my clients do. Take what I say with a grain of salt.  I’m
just a skinny little boy from Cleveland Ohio, come to chase your beers and drink your women…

On Jan 9, 2014, at 10:48 AM, James Taylor <jtaylor@salesforce.com> wrote:

> IMHO, it would be valuable if the design considered both a global
> indexing solution and a local indexing solution. Both are useful in
> different circumstances. The global indexing design plus the
> application integration points could be derived from Jesse's work with
> his reference implementation in Phoenix - the global indexing code has
> no Phoenix dependencies and clearly defined integration points.
> Thanks,
> James
> On Jan 9, 2014, at 6:36 AM, Jesse Yates <jesse.k.yates@gmail.com> wrote:
>> Yes, that was a big concern I had as well.
>> It's not clear how that will work with a large number of indexes; if people
>> have one index, they will want more than one. To not plan for that seems
>> like an incomplete implementation to me. In a horizontally scalable system
>> like HBase, lots of buddy region isn't going to work out well..* Once we
>> have regions that cannot be collocated, the extra RPC time starts to be the
>> biggest factor (as the doc points out) and we are back to what Phoenix is
>> already doing**.
>> But I'm probably missing something here in what makes it different?
>> For folks that haven't been following the issue some high-level "how it all
>> kinda works" would be helpful from the championing commiters; that's a long
>> doc to get through and grok :). How similar is this to the work currently
>> by the existing indexing implementations (huawei, Phoenix, ngdata)? The doc
>> doesn't really nail down the interactions, but instead just right in after
>> describing why SI should be added.
>> Agree this would be super useful, but don't want to waste too much work
>> reinventing the wheel or doing the wrong thing. further, this impl quickly
>> starts to lead down the query optimization path, which get HBase away from
>> its core "be a great byte store".
>> Like I said, I'm all for secondary indexes in HBase and think this is a
>> great push. I don't mean to rain on any parades.
>> - jesse
>> * but a smart way to specify region collocation? That I can get behind as
>> it would unify a couple different indexing impls (e.g Phoenix would
>> consider using it to help make indexing faster - RPCs do suck).
>> ** for instance, the doc talks about how to implement indexing for
>> floats... That might be a default impl, but for use cases like Phoenix this
>> would break all our current encodings. We handled this is the indexing impl
>> by making the builder pluggable for different use cases to support
>> different encodings. I feel like a lot of the code for this kind of SI
>> impl is already in Phoenix and has been working and fast for several months
>> now; it's surprisingly tricky, especially with the delete cases and time
>> stamp manipulation issues.
>> On Thursday, January 9, 2014, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
>> wrote:
>>> Could you explain how the 1-1 association between user and index table
>>> regions is maintained. I wasn't able to understand fully from the document.
>>> ----- Original Message -----
>>> From: Ted Yu <dev@hbase.apache.org>
>>> To: dev@hbase.apache.org
>>> At: Jan 8, 2014 3:41:40 PM
>>> Hi,
>>> Secondary index support is a frequently requested feature.
>>> Please find the updated design doc here:
>>> https://issues.apache.org/jira/secure/attachment/12621909/SecondaryIndex%20Design_Updated_2.pdf
>>> HBASE-9203 is the umbrella JIRA.
>>> Implementation patch was attached to HBASE-10222
>>> Thanks to Rajesh who works on this feature.
>>> Cheers
>> --
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com

View raw message