hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: secondary index feature
Date Fri, 03 Jan 2014 21:11:19 GMT
Hi James,

this is a little embarassing... I even browsed through the code and read 
it as implementing a region level index.

But now at least I get the restrictions mentioned for using the covered 
indexes.

Thanks for clarifying. Guess I need to browse the code a little harder ;-)

Henning

On 03.01.2014 21:53, James Taylor wrote:
> Hi Henning,
> Phoenix maintains a global index. It is essentially maintaining another
> HBase table for you with a different row key (and a subset of your data
> table columns that are "covered"). When an index is used by Phoenix, it is
> *exactly* like querying a data table (that's what Phoenix does - it ends up
> issuing a Phoenix query against a Phoenix table that happens to be an index
> table).
>
> The hit you take for a global index is at write time - we need to look up
> the prior state of the rows being updated to do the index maintenance. Then
> we need to do a write to the index table. The upside is that there's no hit
> at read/query time (we don't yet attempt to join from the index table back
> to the data table - if a query is using columns that aren't in the index,
> it simply won't be used). More here:
> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
>
> Thanks,
> James
>
>
> On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:
>
>> When scanning in order of an index and you use RLI, it seems, there is no
>> alternative but to involve all regions - and essentially this should happen
>> in parallel as otherwise you might not get what you wanted. Also, for a
>> single Get, it seems (as Lars pointed out in https://issues.apache.org/
>> jira/browse/HBASE-2038) that you have to consult all regions.
>>
>> When that parallelism is no problem (small number of servers) it will
>> actually help single scan performance (regions can provide their share in
>> parallel).
>>
>> A high number of concurrent client requests leads to the same number of
>> requests on all regions and its multiple of connections to be maintained by
>> the client.
>>
>> My assumption is that that will eventually lead to a scalability problem -
>> when, say, having a 100 region servers or so in place. I was wondering, if
>> anyone has experience with that.
>>
>> That will be perfectly acceptable for many use cases that benefit from the
>> scan (and hence query) performance more than they suffer from the load
>> problem. Other use cases have less requirements on scans and query
>> flexibility but rather want to preserve the quality that a Get has fixed
>> resource usage.
>>
>> Btw.: I was convinces that Phoenix is keeping indexes on the region level.
>> Is that not so?
>>
>> Thanks,
>> Henning
>>
>>
>> On 03.01.2014 17:57, Anoop John wrote:
>>
>>> In case of HBase normal scan as we know, regions will be scanned
>>> sequentially.  Pheonix having parallel scan impls in it.  When RLI is used
>>> and we make use of index completely at server side, it is irrespective of
>>> client scan ways. Sequential or parallel, using java or any other client
>>> layer or using SQL layer like Phoenix, using MR or not...  all client side
>>> dont have to worry abt this but the usage will be fully at server end.
>>>
>>> Yes when parallel scan is done on regions, RLI might perform much better.
>>>
>>> -Anoop-
>>>
>>> On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
>>> rajeshbabu.chintaguntla@huawei.com> wrote:
>>>
>>>   No. the regions scanned sequentially.
>>>> ________________________________________
>>>> From: Asaf Mesika [asaf.mesika@gmail.com]
>>>> Sent: Friday, January 03, 2014 7:26 PM
>>>> To: user@hbase.apache.org
>>>>    Subject: Re: secondary index feature
>>>>
>>>> Are the regions scanned in parallel?
>>>>
>>>> On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:
>>>>
>>>>   Here are some performance numbers with RLI.
>>>>> No Region servers : 4
>>>>> Data per region    : 2 GB
>>>>>
>>>>> Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values| Time
>>>>> taken(sec)|
>>>>>    50 | 200| 64|199|102
>>>>> 50  | 200|8|199| 35
>>>>> 100|400 | 8| 350| 95
>>>>> 200| 800| 8| 353| 153
>>>>>
>>>>> Without secondary index scan is taking in hours.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rajeshbabu
>>>>> ________________________________________
>>>>> From: Anoop John [anoop.hbase@gmail.com <javascript:;>]
>>>>> Sent: Friday, January 03, 2014 3:22 PM
>>>>> To: user@hbase.apache.org <javascript:;>
>>>>> Subject: Re: secondary index feature
>>>>>
>>>>>   Is there any data on how RLI (or in particular Phoenix) query
>>>>>> throughput
>>>>>>
>>>>> correlates with the number of region servers assuming homogeneously
>>>>> distributed data?
>>>>>
>>>>> Phoenix is yet to add RLI. Now it is having global indexing only.
>>>>> Correct
>>>>> James?
>>>>>
>>>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But
I
>>>>> doubt whether it is there large no# RSs.  Do you have some data Rajesh
>>>>> Babu?
>>>>>
>>>>> -Anoop-
>>>>>
>>>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <henning.blohm@zfabrik.de
>>>>>
>>>>>> wrote:
>>>>>> Jesse, James, Lars,
>>>>>>
>>>>>> after looking around a bit and in particular looking into Phoenix
>>>>>>
>>>>> (which
>>>>> I
>>>>>
>>>>>> find very interesting), assuming that you want a secondary indexing
on
>>>>>> HBASE without adding other infrastructure, there seems to be not
a lot
>>>>>>
>>>>> of
>>>>> choice really: Either go with a region-level (and co-processor based)
>>>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index
>>>>>>
>>>>> table
>>>>> to store (index value, entity key) pairs.
>>>>>> The main concern I have with region-level indexing (RLI) is that
Gets
>>>>>> potentially require to visit all regions. Compared to global index
>>>>>>
>>>>> tables
>>>>> this seems to flatten the read-scalability curve of the cluster. In our
>>>>>> case, we have a large data set (hence HBASE) that will be queried
>>>>>>
>>>>> (mostly
>>>>> point-gets via an index) in some linear correlation with its size.
>>>>>> Is there any data on how RLI (or in particular Phoenix) query
>>>>>>
>>>>> throughput
>>>>> correlates with the number of region servers assuming homogeneously
>>>>>> distributed data?
>>>>>>
>>>>>> Thanks,
>>>>>> Henning
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24.12.2013 12:18, Henning Blohm wrote:
>>>>>>
>>>>>>     All that sounds very promising. I will give it a try and let
you
>>>>>>> know
>>>>>>> how things worked out.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Henning
>>>>>>>
>>>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote:
>>>>>>>
>>>>>>>     The work that James is referencing grew out of the discussions
Lars
>>>>>>>> and I
>>>>>>>> had (which lead to those blog posts). The solution we implement
is
>>>>>>>> designed
>>>>>>>> to be generic, as James mentioned above, but was written
with all the
>>>>>>>> hooks
>>>>>>>> necessary for Phoenix to do some really fast updates (or
skipping
>>>>>>>>
>>>>>>> updates
>>>>>> in the case where there is no change).
>>>>>>>> You should be able to plug in your own simple index builder
(there is
>>>>>>>> an example
>>>>>>>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/
>>>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>)
>>>>>>>> to basic solution which supports the same transactional guarantees
as
>>>>>>>> HBase
>>>>>>>> (per row) + data guarantees across the index rows. There
are more
>>>>>>>>
>>>>>>> details
>>>>>> in the presentations James linked.
>>>>>>>> I'd love you see if your implementation can fit into the
framework we
>>>>>>>> wrote
>>>>>>>> - we would be happy to work to see if it needs some more
hooks or
>>>>>>>> modifications - I have a feeling this is pretty much what
you guys
>>>>>>>>
>>>>>>> will
>>>>> need
>>>>>>>> -Jesse
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<
>>>>>>>>
>>>>>>> jtaylor@salesforce.com>
>>>>> wrote:
>>>>>>>>    Henning,
>>>>>>>>
>>>>>>>>> Jesse Yates wrote the back-end of our global secondary
indexing
>>>>>>>>>
>>>>>>>> system
>>>>> in
>>>>>>>>> Phoenix. He designed it as a separate, pluggable module
with no
>>>>>>>>>
>>>>>>>> Phoenix
>>>>>> dependencies. Here's an overview of the feature:
>>>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing.
The
>>>>>>>>> section that discusses the data guarantees and failure
management
>>>>>>>>>
>>>>>>>> might
>>>>>> be
>>>>>>>>> of interest to you:
>>>>>>>>>
>>>>>>>>>   https://github.com/forcedotcom/phoenix/wiki/
>>>> Secondary-Indexing#data-
>>>>
>>>>> guarantees-and-failure-management
>>>>>>>>> This presentation also gives a good overview of the pluggability
of
>>>>>>>>>
>>>>>>>> his
>>>>>
>> --
>> Henning Blohm
>>
>> *ZFabrik Software KG*
>>
>> T:      +49 6227 3984255
>> F:      +49 6227 3984254
>> M:      +49 1781891820
>>
>> Lammstrasse 2 69190 Walldorf
>>
>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>> ZFabrik <http://www.zfabrik.de>
>> Blog <http://www.z2-environment.net/blog>
>> Z2-Environment <http://www.z2-environment.eu>
>> Z2 Wiki <http://redmine.z2-environment.net>
>>
>>


-- 
Henning Blohm

*ZFabrik Software KG*

T: 	+49 6227 3984255
F: 	+49 6227 3984254
M: 	+49 1781891820

Lammstrasse 2 69190 Walldorf

henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message