hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: secondary index feature
Date Sat, 04 Jan 2014 18:32:05 GMT
Thanks James! I have some Phoenix specific questions. I suppose the 
Phoenix group is a better place to discuss those though.

Henning

On 03.01.2014 22:34, James Taylor wrote:
> No worries, Henning. It's a little deceiving, because the coprocessors that
> do the index maintenance are invoked on a per region basis. However, the
> writes/puts that they do for the maintenance end up going over the wire if
> necessary.
>
> Let me know if you have other questions. It'd be good to understand your
> use case more to see if Phoenix is a good fit - we're definitely open to
> collaborating. FYI, we're in the process of moving to Apache, so will keep
> you posted once the transition is complete.
>
> Thanks,
>
> James
>
>
> On Fri, Jan 3, 2014 at 1:11 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:
>
>> Hi James,
>>
>> this is a little embarassing... I even browsed through the code and read
>> it as implementing a region level index.
>>
>> But now at least I get the restrictions mentioned for using the covered
>> indexes.
>>
>> Thanks for clarifying. Guess I need to browse the code a little harder ;-)
>>
>> Henning
>>
>>
>> On 03.01.2014 21:53, James Taylor wrote:
>>
>>> Hi Henning,
>>> Phoenix maintains a global index. It is essentially maintaining another
>>> HBase table for you with a different row key (and a subset of your data
>>> table columns that are "covered"). When an index is used by Phoenix, it is
>>> *exactly* like querying a data table (that's what Phoenix does - it ends
>>> up
>>> issuing a Phoenix query against a Phoenix table that happens to be an
>>> index
>>> table).
>>>
>>> The hit you take for a global index is at write time - we need to look up
>>> the prior state of the rows being updated to do the index maintenance.
>>> Then
>>> we need to do a write to the index table. The upside is that there's no
>>> hit
>>> at read/query time (we don't yet attempt to join from the index table back
>>> to the data table - if a query is using columns that aren't in the index,
>>> it simply won't be used). More here:
>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
>>>
>>> Thanks,
>>> James
>>>
>>>
>>> On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <henning.blohm@zfabrik.de>
>>> wrote:
>>>
>>>   When scanning in order of an index and you use RLI, it seems, there is no
>>>> alternative but to involve all regions - and essentially this should
>>>> happen
>>>> in parallel as otherwise you might not get what you wanted. Also, for a
>>>> single Get, it seems (as Lars pointed out in https://issues.apache.org/
>>>> jira/browse/HBASE-2038) that you have to consult all regions.
>>>>
>>>> When that parallelism is no problem (small number of servers) it will
>>>> actually help single scan performance (regions can provide their share in
>>>> parallel).
>>>>
>>>> A high number of concurrent client requests leads to the same number of
>>>> requests on all regions and its multiple of connections to be maintained
>>>> by
>>>> the client.
>>>>
>>>> My assumption is that that will eventually lead to a scalability problem
>>>> -
>>>> when, say, having a 100 region servers or so in place. I was wondering,
>>>> if
>>>> anyone has experience with that.
>>>>
>>>> That will be perfectly acceptable for many use cases that benefit from
>>>> the
>>>> scan (and hence query) performance more than they suffer from the load
>>>> problem. Other use cases have less requirements on scans and query
>>>> flexibility but rather want to preserve the quality that a Get has fixed
>>>> resource usage.
>>>>
>>>> Btw.: I was convinces that Phoenix is keeping indexes on the region
>>>> level.
>>>> Is that not so?
>>>>
>>>> Thanks,
>>>> Henning
>>>>
>>>>
>>>> On 03.01.2014 17:57, Anoop John wrote:
>>>>
>>>>   In case of HBase normal scan as we know, regions will be scanned
>>>>> sequentially.  Pheonix having parallel scan impls in it.  When RLI is
>>>>> used
>>>>> and we make use of index completely at server side, it is irrespective
>>>>> of
>>>>> client scan ways. Sequential or parallel, using java or any other client
>>>>> layer or using SQL layer like Phoenix, using MR or not...  all client
>>>>> side
>>>>> dont have to worry abt this but the usage will be fully at server end.
>>>>>
>>>>> Yes when parallel scan is done on regions, RLI might perform much
>>>>> better.
>>>>>
>>>>> -Anoop-
>>>>>
>>>>> On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
>>>>> rajeshbabu.chintaguntla@huawei.com> wrote:
>>>>>
>>>>>    No. the regions scanned sequentially.
>>>>>
>>>>>> ________________________________________
>>>>>> From: Asaf Mesika [asaf.mesika@gmail.com]
>>>>>> Sent: Friday, January 03, 2014 7:26 PM
>>>>>> To: user@hbase.apache.org
>>>>>>     Subject: Re: secondary index feature
>>>>>>
>>>>>> Are the regions scanned in parallel?
>>>>>>
>>>>>> On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:
>>>>>>
>>>>>>    Here are some performance numbers with RLI.
>>>>>>
>>>>>>> No Region servers : 4
>>>>>>> Data per region    : 2 GB
>>>>>>>
>>>>>>> Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values|
>>>>>>> Time
>>>>>>> taken(sec)|
>>>>>>>     50 | 200| 64|199|102
>>>>>>> 50  | 200|8|199| 35
>>>>>>> 100|400 | 8| 350| 95
>>>>>>> 200| 800| 8| 353| 153
>>>>>>>
>>>>>>> Without secondary index scan is taking in hours.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rajeshbabu
>>>>>>> ________________________________________
>>>>>>> From: Anoop John [anoop.hbase@gmail.com <javascript:;>]
>>>>>>> Sent: Friday, January 03, 2014 3:22 PM
>>>>>>> To: user@hbase.apache.org <javascript:;>
>>>>>>> Subject: Re: secondary index feature
>>>>>>>
>>>>>>>    Is there any data on how RLI (or in particular Phoenix) query
>>>>>>>
>>>>>>>> throughput
>>>>>>>>
>>>>>>>>   correlates with the number of region servers assuming homogeneously
>>>>>>> distributed data?
>>>>>>>
>>>>>>> Phoenix is yet to add RLI. Now it is having global indexing only.
>>>>>>> Correct
>>>>>>> James?
>>>>>>>
>>>>>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions..
>>>>>>> But I
>>>>>>> doubt whether it is there large no# RSs.  Do you have some data
Rajesh
>>>>>>> Babu?
>>>>>>>
>>>>>>> -Anoop-
>>>>>>>
>>>>>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <
>>>>>>> henning.blohm@zfabrik.de
>>>>>>>
>>>>>>>   wrote:
>>>>>>>> Jesse, James, Lars,
>>>>>>>>
>>>>>>>> after looking around a bit and in particular looking into
Phoenix
>>>>>>>>
>>>>>>>>   (which
>>>>>>> I
>>>>>>>
>>>>>>>   find very interesting), assuming that you want a secondary
indexing
>>>>>>>> on
>>>>>>>> HBASE without adding other infrastructure, there seems to
be not a
>>>>>>>> lot
>>>>>>>>
>>>>>>>>   of
>>>>>>> choice really: Either go with a region-level (and co-processor
based)
>>>>>>>
>>>>>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add
an index
>>>>>>>>
>>>>>>>>   table
>>>>>>> to store (index value, entity key) pairs.
>>>>>>>
>>>>>>>> The main concern I have with region-level indexing (RLI)
is that Gets
>>>>>>>> potentially require to visit all regions. Compared to global
index
>>>>>>>>
>>>>>>>>   tables
>>>>>>> this seems to flatten the read-scalability curve of the cluster.
In
>>>>>>> our
>>>>>>>
>>>>>>>> case, we have a large data set (hence HBASE) that will be
queried
>>>>>>>>
>>>>>>>>   (mostly
>>>>>>> point-gets via an index) in some linear correlation with its
size.
>>>>>>>
>>>>>>>> Is there any data on how RLI (or in particular Phoenix) query
>>>>>>>>
>>>>>>>>   throughput
>>>>>>> correlates with the number of region servers assuming homogeneously
>>>>>>>
>>>>>>>> distributed data?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Henning
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24.12.2013 12:18, Henning Blohm wrote:
>>>>>>>>
>>>>>>>>      All that sounds very promising. I will give it a try
and let you
>>>>>>>>
>>>>>>>>> know
>>>>>>>>> how things worked out.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Henning
>>>>>>>>>
>>>>>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote:
>>>>>>>>>
>>>>>>>>>      The work that James is referencing grew out of the
discussions
>>>>>>>>> Lars
>>>>>>>>>
>>>>>>>>>> and I
>>>>>>>>>> had (which lead to those blog posts). The solution
we implement is
>>>>>>>>>> designed
>>>>>>>>>> to be generic, as James mentioned above, but was
written with all
>>>>>>>>>> the
>>>>>>>>>> hooks
>>>>>>>>>> necessary for Phoenix to do some really fast updates
(or skipping
>>>>>>>>>>
>>>>>>>>>>   updates
>>>>>>>> in the case where there is no change).
>>>>>>>>
>>>>>>>>> You should be able to plug in your own simple index builder
(there
>>>>>>>>>> is
>>>>>>>>>> an example
>>>>>>>>>> in the phoenix codebase<https://github.com/
>>>>>>>>>> forcedotcom/phoenix/tree/
>>>>>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>)
>>>>>>>>>> to basic solution which supports the same transactional
guarantees
>>>>>>>>>> as
>>>>>>>>>> HBase
>>>>>>>>>> (per row) + data guarantees across the index rows.
There are more
>>>>>>>>>>
>>>>>>>>>>   details
>>>>>>>> in the presentations James linked.
>>>>>>>>
>>>>>>>>> I'd love you see if your implementation can fit into
the framework
>>>>>>>>>> we
>>>>>>>>>> wrote
>>>>>>>>>> - we would be happy to work to see if it needs some
more hooks or
>>>>>>>>>> modifications - I have a feeling this is pretty much
what you guys
>>>>>>>>>>
>>>>>>>>>>   will
>>>>>>>> need
>>>>>>>> -Jesse
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<
>>>>>>>>>>
>>>>>>>>>>   jtaylor@salesforce.com>
>>>>>>>> wrote:
>>>>>>>>     Henning,
>>>>>>>>>>   Jesse Yates wrote the back-end of our global secondary
indexing
>>>>>>>>>>>   system
>>>>>>>>> in
>>>>>>>> Phoenix. He designed it as a separate, pluggable module with
no
>>>>>>>>>>>   Phoenix
>>>>>>>>> dependencies. Here's an overview of the feature:
>>>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing.
The
>>>>>>>>>>> section that discusses the data guarantees and
failure management
>>>>>>>>>>>
>>>>>>>>>>>   might
>>>>>>>>> be
>>>>>>>>> of interest to you:
>>>>>>>>>>>    https://github.com/forcedotcom/phoenix/wiki/
>>>>>>>>>>>
>>>>>>>>>> Secondary-Indexing#data-
>>>>>>   guarantees-and-failure-management
>>>>>>>> This presentation also gives a good overview of the pluggability
of
>>>>>>>>>>>   his
>>>>>>>   --
>>>> Henning Blohm
>>>>
>>>> *ZFabrik Software KG*
>>>>
>>>> T:      +49 6227 3984255
>>>> F:      +49 6227 3984254
>>>> M:      +49 1781891820
>>>>
>>>> Lammstrasse 2 69190 Walldorf
>>>>
>>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>>> ZFabrik <http://www.zfabrik.de>
>>>> Blog <http://www.z2-environment.net/blog>
>>>> Z2-Environment <http://www.z2-environment.eu>
>>>> Z2 Wiki <http://redmine.z2-environment.net>
>>>>
>>>>
>>>>
>> --
>> Henning Blohm
>>
>> *ZFabrik Software KG*
>>
>> T:      +49 6227 3984255
>> F:      +49 6227 3984254
>> M:      +49 1781891820
>>
>> Lammstrasse 2 69190 Walldorf
>>
>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>> ZFabrik <http://www.zfabrik.de>
>> Blog <http://www.z2-environment.net/blog>
>> Z2-Environment <http://www.z2-environment.eu>
>> Z2 Wiki <http://redmine.z2-environment.net>
>>
>>


-- 
Henning Blohm

*ZFabrik Software KG*

T: 	+49 6227 3984255
F: 	+49 6227 3984254
M: 	+49 1781891820

Lammstrasse 2 69190 Walldorf

henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message