hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Design review: Secondary index support through coprocess
Date Mon, 20 Jan 2014 19:39:02 GMT
Yes, the coprocessors potentially cross RS boundaries. No, the index is not
co-located with the main table. Take a look at the link I sent as that
should be able to answer a lot of questions.

Thanks,
James


On Mon, Jan 20, 2014 at 11:03 AM, Michael Segel
<michael_segel@hotmail.com>wrote:

> James,
>
> Ok…
>
> Its been a while since we talked about this…
>
> While the index is in a separate table, is that table being split and
> collocated with the main table?
>
> If you’re using the coprocessor to maintain the index, that would imply
> you’re crossing RS boundaries if your index is truly orthogonal.
>
> Is this what you’re doing?
>
> On Jan 20, 2014, at 11:32 AM, James Taylor <jtaylor@salesforce.com> wrote:
>
> > Mike,
> > Yes, you're mistaken:
> > - secondary indexes in Phoenix are orthogonal to the base table. They're
> in
> > a separate table (
> > http://phoenix.incubator.apache.org/secondary_indexing.html).
> > - Phoenix has joins. They're in our master branch with a release
> scheduled
> > for next month
> > - numeric strings? Not a use case for indexing numeric data? Have you
> ever
> > seen a number used as an ID?
> > Thanks,
> > James
> >
> >
> > On Mon, Jan 20, 2014 at 8:50 AM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> >> Indexes tend to be orthogonal to the base table, not to mention if
> you’re
> >> using an inverted table for an index, your index table would be much
> >> thinner than your base table.
> >>
> >> Having said that, the solution proposed by Yu, Taylor and others only
> >> works if you want to use the index to help on server side filtering and
> >> misses the boat on the larger and broader picture of improving query
> >> optimization and joins.
> >>
> >> HINT: Unless I am mistaken… until you treat the index as orthogonal to
> the
> >> base table, you will always lag performance of traditional MPP DWs like
> >> Informix XPS. (Now part of IBM’s IM pillar )
> >>
> >> In addition, until you fix coprocessors in general, you will have
> >> scalability and performance issues.
> >> (Note that you can write a coprocessor to create a sandbox and separate
> >> the co-process from the RS jvm, however it would be better if it were
> part
> >> of the underlying coprocessor code. )
> >>
> >> The current implementation makes joins worthless.
> >> (Note that in prior discussions,  Phoenix doesn’t do joins…)
> >> Here’s why:
> >> In order to do a join, if you use the proposed index, you have to first
> >> reduce each index in to a single, sort ordered set.  Then you can take
> the
> >> intersection of the index result sets.  The final set would be in sort
> >> order and a subset of the total rows. You can then fetch the rows and
> still
> >> do a server side filter before returning the ultimate result set.
> >>
> >> Its that first step of reducing each result set in to a single sort
> >> ordered set that takes a lot of effort.
> >>
> >>
> >> On a side note…. there’s been some mention of ordering floats. Again,
> just
> >> a word of caution… there isn’t a really strong use case for indexing
> >> numeric data types. period.  And to be very, very clear, there is a
> >> distinction between numeric strings and numeric data types.
> >>
> >> -Mike
> >>
> >> PS. Because of my role as a consultant, I am very, very limited in what
> I
> >> can say and contribute. I don’t own my work product, my clients do. Take
> >> what I say with a grain of salt.  I’m just a skinny little boy from
> >> Cleveland Ohio, come to chase your beers and drink your women… ;-)
> >>
> >> On Jan 9, 2014, at 10:48 AM, James Taylor <jtaylor@salesforce.com>
> wrote:
> >>
> >>> IMHO, it would be valuable if the design considered both a global
> >>> indexing solution and a local indexing solution. Both are useful in
> >>> different circumstances. The global indexing design plus the
> >>> application integration points could be derived from Jesse's work with
> >>> his reference implementation in Phoenix - the global indexing code has
> >>> no Phoenix dependencies and clearly defined integration points.
> >>>
> >>> Thanks,
> >>> James
> >>>
> >>> On Jan 9, 2014, at 6:36 AM, Jesse Yates <jesse.k.yates@gmail.com>
> wrote:
> >>>
> >>>> Yes, that was a big concern I had as well.
> >>>>
> >>>> It's not clear how that will work with a large number of indexes; if
> >> people
> >>>> have one index, they will want more than one. To not plan for that
> seems
> >>>> like an incomplete implementation to me. In a horizontally scalable
> >> system
> >>>> like HBase, lots of buddy region isn't going to work out well..* Once
> we
> >>>> have regions that cannot be collocated, the extra RPC time starts to
> be
> >> the
> >>>> biggest factor (as the doc points out) and we are back to what Phoenix
> >> is
> >>>> already doing**.
> >>>>
> >>>> But I'm probably missing something here in what makes it different?
> >>>>
> >>>> For folks that haven't been following the issue some high-level "how
> it
> >> all
> >>>> kinda works" would be helpful from the championing commiters; that's
a
> >> long
> >>>> doc to get through and grok :). How similar is this to the work
> >> currently
> >>>> by the existing indexing implementations (huawei, Phoenix, ngdata)?
> The
> >> doc
> >>>> doesn't really nail down the interactions, but instead just right in
> >> after
> >>>> describing why SI should be added.
> >>>>
> >>>> Agree this would be super useful, but don't want to waste too much
> work
> >>>> reinventing the wheel or doing the wrong thing. further, this impl
> >> quickly
> >>>> starts to lead down the query optimization path, which get HBase away
> >> from
> >>>> its core "be a great byte store".
> >>>>
> >>>> Like I said, I'm all for secondary indexes in HBase and think this is
> a
> >>>> great push. I don't mean to rain on any parades.
> >>>>
> >>>> - jesse
> >>>>
> >>>> * but a smart way to specify region collocation? That I can get behind
> >> as
> >>>> it would unify a couple different indexing impls (e.g Phoenix would
> >>>> consider using it to help make indexing faster - RPCs do suck).
> >>>>
> >>>> ** for instance, the doc talks about how to implement indexing for
> >>>> floats... That might be a default impl, but for use cases like Phoenix
> >> this
> >>>> would break all our current encodings. We handled this is the indexing
> >> impl
> >>>> by making the builder pluggable for different use cases to support
> >>>> different encodings. I feel like a lot of the code for this kind of
SI
> >>>> impl is already in Phoenix and has been working and fast for several
> >> months
> >>>> now; it's surprisingly tricky, especially with the delete cases and
> time
> >>>> stamp manipulation issues.
> >>>>
> >>>>
> >>>> On Thursday, January 9, 2014, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
> >>>> wrote:
> >>>>
> >>>>> Could you explain how the 1-1 association between user and index
> table
> >>>>> regions is maintained. I wasn't able to understand fully from the
> >> document.
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: Ted Yu <dev@hbase.apache.org>
> >>>>> To: dev@hbase.apache.org
> >>>>> At: Jan 8, 2014 3:41:40 PM
> >>>>>
> >>>>> Hi,
> >>>>> Secondary index support is a frequently requested feature.
> >>>>>
> >>>>> Please find the updated design doc here:
> >>>>>
> >>>>>
> >>
> https://issues.apache.org/jira/secure/attachment/12621909/SecondaryIndex%20Design_Updated_2.pdf
> >>>>>
> >>>>> HBASE-9203 is the umbrella JIRA.
> >>>>>
> >>>>> Implementation patch was attached to HBASE-10222
> >>>>>
> >>>>> Thanks to Rajesh who works on this feature.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -------------------
> >>>> Jesse Yates
> >>>> @jesse_yates
> >>>> jyates.github.com
> >>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message