hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: secondary index feature
Date Mon, 23 Dec 2013 19:28:13 GMT
James,

that is super interesting material!

Thanks,
Henning

On 23.12.2013 19:01, James Taylor wrote:
> Henning,
> Jesse Yates wrote the back-end of our global secondary indexing system in
> Phoenix. He designed it as a separate, pluggable module with no Phoenix
> dependencies. Here's an overview of the feature:
> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section
> that discusses the data guarantees and failure management might be of
> interest to you:
> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management
>
> This presentation also gives a good overview of the pluggability of his
> implementation:
> http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx
>
> Thanks,
> James
>
>
> On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm <henning.blohm@zfabrik.de>wrote:
>
>> Lars, that is exactly why I am hesitant to use one the core level generic
>> approaches (apart from having difficulties to identify the still active
>> projects): I have doubts I can sufficiently explain to myself when and
>> where they fail.
>>
>> With "toolbox approach" I meant to say that turning entity data into index
>> data is not done generically but rather involving domain specific
>> application code that
>>
>> - indicates what makes an index key given an entity
>> - indicates whether an index entry is still valid given an entity
>>
>> That code is also used during the index rebuild and trimming (an M/R Job)
>>
>> So validating whether an index entry is valid means to load the entity
>> pointed to and - before considering it a valid result - validating whether
>> values of the entity still match with the index.
>>
>> The entity is written last, hence when the client dies halfway through the
>> update you may get stale index entries but nothing else should break.
>>
>> For scanning along the index, we are using a chunk iterator that is, we
>> read n index entries ahead and then do point look ups for the entities. How
>> would you avoid point-gets when scanning via an index (as most likely,
>> entities are ordered independently from the index - hence the index)?
>>
>> Something really important to note is that there is no intention to build
>> a completely generic solution, in particular not (this time - unlike the
>> other post of mine you responded to) taking row versioning into account.
>> Instead, row time stamps are used to delete stale entries (old entries
>> after an index rebuild).
>>
>> Thanks a lot for your blog pointers. Haven't had time to study in depth
>> but at first glance there is lot of overlap of what you are proposing and
>> what I ended up doing considering the first post.
>>
>> On the second post: Indeed I have not worried too much about transactional
>> isolation of updates. If index update and entity update use the same HBase
>> time stamp, the result should at least be consistent, right?
>>
>> Btw. in no way am I claiming originality of my thoughts - in particular I
>> read http://jyates.github.io/2012/07/09/consistent-enough-
>> secondary-indexes.html a while back.
>>
>> Thanks,
>> Henning
>>
>> Ps.: I might write about this discussion later in my blog
>>
>>
>> On 22.12.2013 23:37, lars hofhansl wrote:
>>
>>> The devil is often in the details. On the surface it looks simple.
>>>
>>> How specifically are the stale indexes ignored? Are the guaranteed to be
>>> no races?
>>> Is deletion handled correctly?Does it work with multiple versions?
>>> What happens when the client dies 1/2 way through an update?
>>> It's easy to do eventually consistent indexes. Truly consistent indexes
>>> without transactions are tricky.
>>>
>>>
>>> Also, scanning an index and then doing point-gets against a main table is
>>> slow (unless the index is very selective. The Phoenix team measured that
>>> there is only an advantage if the index filters out 98-99% of the data).
>>> So then one would revert to covered indexes and suddenly is not so easy
>>> to detect stale index entries.
>>>
>>> I blogged about these issues here:
>>> http://hadoop-hbase.blogspot.com/2012/10/musings-on-
>>> secondary-indexes.html
>>> http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html
>>>
>>> Phoenix has a (pretty involved) solution now that works around the fact
>>> that HBase has no transactions.
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>    From: Henning Blohm <henning.blohm@zfabrik.de>
>>> To: user <user@hbase.apache.org>
>>> Sent: Sunday, December 22, 2013 2:11 AM
>>> Subject: secondary index feature
>>>
>>> Lately we have added a secondary index feature to a persistence tier
>>> over HBASE. Essentially we implemented what is described as "Dual-Write
>>> Secondary Index" in http://hbase.apache.org/book/secondary.indexes.html.
>>> I.e. while updating an entity, actually before writing the actual
>>> update, indexes are updated. Lookup via the index ignores stale entries.
>>> A recurring rebuild and clean out of stale entries takes care the
>>> indexes are trimmed and accurate.
>>>
>>> None of this was terribly complex to implement. In fact, it seemed like
>>> something you could do generically, maybe not on the HBASE level itself,
>>> but as a toolbox / utility style library.
>>>
>>> Is anybody on the list aware of anything useful already existing in that
>>> space?
>>>
>>> Thanks,
>>> Henning Blohm
>>>
>>> *ZFabrik Software KG*
>>>
>>> T:     +49 6227 3984255
>>> F:     +49 6227 3984254
>>> M:     +49 1781891820
>>>
>>> Lammstrasse 2 69190 Walldorf
>>>
>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>> ZFabrik <http://www.zfabrik.de>
>>> Blog <http://www.z2-environment.net/blog>
>>> Z2-Environment <http://www.z2-environment.eu>
>>> Z2 Wiki <http://redmine.z2-environment.net>
>>>
>>
>> --
>> Henning Blohm
>>
>> *ZFabrik Software KG*
>>
>> T:      +49 6227 3984255
>> F:      +49 6227 3984254
>> M:      +49 1781891820
>>
>> Lammstrasse 2 69190 Walldorf
>>
>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>> ZFabrik <http://www.zfabrik.de>
>> Blog <http://www.z2-environment.net/blog>
>> Z2-Environment <http://www.z2-environment.eu>
>> Z2 Wiki <http://redmine.z2-environment.net>
>>
>>


-- 
Henning Blohm

*ZFabrik Software KG*

T: 	+49 6227 3984255
F: 	+49 6227 3984254
M: 	+49 1781891820

Lammstrasse 2 69190 Walldorf

henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message