I think both approaches should be provided to HBase users.
These are new features that would both find proper usage scenarios.
Cheers
On Jan 3, 2014, at 5:48 AM, ramkrishna vasudevan <ramkrishna.s.vasudevan@gmail.com>
wrote:
> What is generally of interest? RLI or global level. I know it is based on
> usecase but is there a common need?
>
>
> On Fri, Jan 3, 2014 at 4:31 PM, Anoop John <anoop.hbase@gmail.com> wrote:
>
>> A proportional difference in time taken, wrt increase in # RSs (keeping
>> No#rows matching values constant), would be what is of utmost interest.
>>
>> -Anoop-
>>
>> On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla <
>> rajeshbabu.chintaguntla@huawei.com> wrote:
>>
>>>
>>> Here are some performance numbers with RLI.
>>>
>>> No Region servers : 4
>>> Data per region : 2 GB
>>>
>>> Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time
>>> taken(sec)|
>>> 50 | 200| 64|199|102
>>> 50 | 200|8|199| 35
>>> 100|400 | 8| 350| 95
>>> 200| 800| 8| 353| 153
>>>
>>> Without secondary index scan is taking in hours.
>>>
>>>
>>> Thanks,
>>> Rajeshbabu
>>> ________________________________________
>>> From: Anoop John [anoop.hbase@gmail.com]
>>> Sent: Friday, January 03, 2014 3:22 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: secondary index feature
>>>
>>>> Is there any data on how RLI (or in particular Phoenix) query
>> throughput
>>> correlates with the number of region servers assuming homogeneously
>>> distributed data?
>>>
>>> Phoenix is yet to add RLI. Now it is having global indexing only. Correct
>>> James?
>>>
>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I
>>> doubt whether it is there large no# RSs. Do you have some data Rajesh
>>> Babu?
>>>
>>> -Anoop-
>>>
>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <henning.blohm@zfabrik.de
>>>> wrote:
>>>
>>>> Jesse, James, Lars,
>>>>
>>>> after looking around a bit and in particular looking into Phoenix
>> (which
>>> I
>>>> find very interesting), assuming that you want a secondary indexing on
>>>> HBASE without adding other infrastructure, there seems to be not a lot
>> of
>>>> choice really: Either go with a region-level (and co-processor based)
>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index
>> table
>>>> to store (index value, entity key) pairs.
>>>>
>>>> The main concern I have with region-level indexing (RLI) is that Gets
>>>> potentially require to visit all regions. Compared to global index
>> tables
>>>> this seems to flatten the read-scalability curve of the cluster. In our
>>>> case, we have a large data set (hence HBASE) that will be queried
>> (mostly
>>>> point-gets via an index) in some linear correlation with its size.
>>>>
>>>> Is there any data on how RLI (or in particular Phoenix) query
>> throughput
>>>> correlates with the number of region servers assuming homogeneously
>>>> distributed data?
>>>>
>>>> Thanks,
>>>> Henning
>>>>
>>>>
>>>>
>>>>
>>>> On 24.12.2013 12:18, Henning Blohm wrote:
>>>>
>>>>> All that sounds very promising. I will give it a try and let you know
>>>>> how things worked out.
>>>>>
>>>>> Thanks,
>>>>> Henning
>>>>>
>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote:
>>>>>
>>>>>> The work that James is referencing grew out of the discussions Lars
>>>>>> and I
>>>>>> had (which lead to those blog posts). The solution we implement is
>>>>>> designed
>>>>>> to be generic, as James mentioned above, but was written with all
the
>>>>>> hooks
>>>>>> necessary for Phoenix to do some really fast updates (or skipping
>>> updates
>>>>>> in the case where there is no change).
>>>>>>
>>>>>> You should be able to plug in your own simple index builder (there
is
>>>>>> an example
>>>>>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/
>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>)
>>>>>> to basic solution which supports the same transactional guarantees
as
>>>>>> HBase
>>>>>> (per row) + data guarantees across the index rows. There are more
>>> details
>>>>>> in the presentations James linked.
>>>>>>
>>>>>> I'd love you see if your implementation can fit into the framework
we
>>>>>> wrote
>>>>>> - we would be happy to work to see if it needs some more hooks or
>>>>>> modifications - I have a feeling this is pretty much what you guys
>> will
>>>>>> need
>>>>>>
>>>>>> -Jesse
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<
>> jtaylor@salesforce.com>
>>>>>> wrote:
>>>>>>
>>>>>> Henning,
>>>>>>> Jesse Yates wrote the back-end of our global secondary indexing
>> system
>>>>>>> in
>>>>>>> Phoenix. He designed it as a separate, pluggable module with
no
>>> Phoenix
>>>>>>> dependencies. Here's an overview of the feature:
>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing.
The
>>>>>>> section that discusses the data guarantees and failure management
>>> might
>>>>>>> be
>>>>>>> of interest to you:
>>>>>>>
>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-
>>>>>>> guarantees-and-failure-management
>>>>>>>
>>>>>>> This presentation also gives a good overview of the pluggability
of
>>> his
>>>>>>> implementation:
>>>>>>>
>> http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx
>>>>>>>
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm<
>>> henning.blohm@zfabrik.de
>>>>>>>> wrote:
>>>>>>>
>>>>>>> Lars, that is exactly why I am hesitant to use one the core level
>>>>>>>> generic
>>>>>>>> approaches (apart from having difficulties to identify the
still
>>> active
>>>>>>>> projects): I have doubts I can sufficiently explain to myself
when
>>> and
>>>>>>>> where they fail.
>>>>>>>>
>>>>>>>> With "toolbox approach" I meant to say that turning entity
data
>> into
>>>>>>>> index data is not done generically but rather involving domain
>>> specific
>>>>>>>> application code that
>>>>>>>>
>>>>>>>> - indicates what makes an index key given an entity
>>>>>>>> - indicates whether an index entry is still valid given an
entity
>>>>>>>>
>>>>>>>> That code is also used during the index rebuild and trimming
(an
>> M/R
>>>>>>>> Job)
>>>>>>>>
>>>>>>>> So validating whether an index entry is valid means to load
the
>>> entity
>>>>>>>> pointed to and - before considering it a valid result - validating
>>>>>>>> whether
>>>>>>>> values of the entity still match with the index.
>>>>>>>>
>>>>>>>> The entity is written last, hence when the client dies halfway
>>> through
>>>>>>>> the update you may get stale index entries but nothing else
should
>>>>>>>> break.
>>>>>>>>
>>>>>>>> For scanning along the index, we are using a chunk iterator
that
>> is,
>>> we
>>>>>>>> read n index entries ahead and then do point look ups for
the
>>>>>>>> entities. How
>>>>>>>> would you avoid point-gets when scanning via an index (as
most
>>> likely,
>>>>>>>> entities are ordered independently from the index - hence
the
>> index)?
>>>>>>>>
>>>>>>>> Something really important to note is that there is no intention
to
>>>>>>>> build
>>>>>>>> a completely generic solution, in particular not (this time
-
>> unlike
>>>>>>>> the
>>>>>>>> other post of mine you responded to) taking row versioning
into
>>>>>>>> account.
>>>>>>>> Instead, row time stamps are used to delete stale entries
(old
>>> entries
>>>>>>>> after an index rebuild).
>>>>>>>>
>>>>>>>> Thanks a lot for your blog pointers. Haven't had time to
study in
>>> depth
>>>>>>>> but at first glance there is lot of overlap of what you are
>> proposing
>>>>>>>> and
>>>>>>>> what I ended up doing considering the first post.
>>>>>>>>
>>>>>>>> On the second post: Indeed I have not worried too much about
>>>>>>>> transactional isolation of updates. If index update and entity
>> update
>>>>>>>> use
>>>>>>>> the same HBase time stamp, the result should at least be
>> consistent,
>>>>>>>> right?
>>>>>>>>
>>>>>>>> Btw. in no way am I claiming originality of my thoughts -
in
>>>>>>>> particular I
>>>>>>>> readhttp://jyates.github.io/2012/07/09/consistent-enough-
>>>>>>>>
>>>>>>>> secondary-indexes.html a while back.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Henning
>>>>>>>>
>>>>>>>> Ps.: I might write about this discussion later in my blog
>>>>>>>>
>>>>>>>>
>>>>>>>> On 22.12.2013 23:37, lars hofhansl wrote:
>>>>>>>>
>>>>>>>> The devil is often in the details. On the surface it looks
simple.
>>>>>>>>>
>>>>>>>>> How specifically are the stale indexes ignored? Are the
guaranteed
>>> to
>>>>>>>>> be
>>>>>>>>> no races?
>>>>>>>>> Is deletion handled correctly?Does it work with multiple
versions?
>>>>>>>>> What happens when the client dies 1/2 way through an
update?
>>>>>>>>> It's easy to do eventually consistent indexes. Truly
consistent
>>>>>>>>> indexes
>>>>>>>>> without transactions are tricky.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, scanning an index and then doing point-gets against
a main
>>> table
>>>>>>>>> is slow (unless the index is very selective. The Phoenix
team
>>>>>>>>> measured that
>>>>>>>>> there is only an advantage if the index filters out 98-99%
of the
>>>>>>>>> data).
>>>>>>>>> So then one would revert to covered indexes and suddenly
is not so
>>>>>>>>> easy
>>>>>>>>> to detect stale index entries.
>>>>>>>>>
>>>>>>>>> I blogged about these issues here:
>>>>>>>>> http://hadoop-hbase.blogspot.com/2012/10/musings-on-
>>>>>>>>> secondary-indexes.html
>>>>>>>>> http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-
>>>>>>>>> part-ii.html
>>>>>>>>>
>>>>>>>>> Phoenix has a (pretty involved) solution now that works
around the
>>>>>>>>> fact
>>>>>>>>> that HBase has no transactions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Lars
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>> From: Henning Blohm<henning.blohm@zfabrik.de>
>>>>>>>>> To: user<user@hbase.apache.org>
>>>>>>>>> Sent: Sunday, December 22, 2013 2:11 AM
>>>>>>>>> Subject: secondary index feature
>>>>>>>>>
>>>>>>>>> Lately we have added a secondary index feature to a persistence
>> tier
>>>>>>>>> over HBASE. Essentially we implemented what is described
as
>>>>>>>>> "Dual-Write
>>>>>>>>> Secondary Index" inhttp://hbase.apache.org/
>>>>>>>>> book/secondary.indexes.html.
>>>>>>>>>
>>>>>>>>> I.e. while updating an entity, actually before writing
the actual
>>>>>>>>> update, indexes are updated. Lookup via the index ignores
stale
>>>>>>>>> entries.
>>>>>>>>> A recurring rebuild and clean out of stale entries takes
care the
>>>>>>>>> indexes are trimmed and accurate.
>>>>>>>>>
>>>>>>>>> None of this was terribly complex to implement. In fact,
it seemed
>>>>>>>>> like
>>>>>>>>> something you could do generically, maybe not on the
HBASE level
>>>>>>>>> itself,
>>>>>>>>> but as a toolbox / utility style library.
>>>>>>>>>
>>>>>>>>> Is anybody on the list aware of anything useful already
existing
>> in
>>>>>>>>> that
>>>>>>>>> space?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Henning Blohm
>>>>>>>>>
>>>>>>>>> *ZFabrik Software KG*
>>>>>>>>>
>>>>>>>>> T: +49 6227 3984255<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>>> F: +49 6227 3984254<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>>> M: +49 1781891820<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>>>
>>>>>>>>> Lammstrasse 2 69190 Walldorf
>>>>>>>>>
>>>>>>>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>>>>>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>>>>>>>> ZFabrik<http://www.zfabrik.de>
>>>>>>>>> Blog<http://www.z2-environment.net/blog>
>>>>>>>>> Z2-Environment<http://www.z2-environment.eu>
>>>>>>>>> Z2 Wiki<http://redmine.z2-environment.net>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Henning Blohm
>>>>>>>>
>>>>>>>> *ZFabrik Software KG*
>>>>>>>>
>>>>>>>> T: +49 6227 3984255<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>> F: +49 6227 3984254<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>> M: +49 1781891820<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>>>>
>>>>>>>> Lammstrasse 2 69190 Walldorf
>>>>>>>>
>>>>>>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>>>>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>>>>>>> ZFabrik<http://www.zfabrik.de>
>>>>>>>> Blog<http://www.z2-environment.net/blog>
>>>>>>>> Z2-Environment<http://www.z2-environment.eu>
>>>>>>>> Z2 Wiki<http://redmine.z2-environment.net>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Henning Blohm
>>>>>
>>>>> *ZFabrik Software KG*
>>>>>
>>>>> T: +49 6227 3984255<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>> F: +49 6227 3984254<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>> M: +49 1781891820<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>>
>>>>> Lammstrasse 2 69190 Walldorf
>>>>>
>>>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>>>> ZFabrik <http://www.zfabrik.de>
>>>>> Blog <http://www.z2-environment.net/blog>
>>>>> Z2-Environment <http://www.z2-environment.eu>
>>>>> Z2 Wiki <http://redmine.z2-environment.net>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Henning Blohm
>>>>
>>>> *ZFabrik Software KG*
>>>>
>>>> T: +49 6227 3984255<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>> F: +49 6227 3984254<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>> M: +49 1781891820<
>>>
>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>>
>>>>
>>>> Lammstrasse 2 69190 Walldorf
>>>>
>>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>>> ZFabrik <http://www.zfabrik.de>
>>>> Blog <http://www.z2-environment.net/blog>
>>>> Z2-Environment <http://www.z2-environment.eu>
>>>> Z2 Wiki <http://redmine.z2-environment.net>
>>>>
>>>>
>>>
>>
|