hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: secondary index feature
Date Fri, 03 Jan 2014 21:34:34 GMT
No worries, Henning. It's a little deceiving, because the coprocessors that
do the index maintenance are invoked on a per region basis. However, the
writes/puts that they do for the maintenance end up going over the wire if
necessary.

Let me know if you have other questions. It'd be good to understand your
use case more to see if Phoenix is a good fit - we're definitely open to
collaborating. FYI, we're in the process of moving to Apache, so will keep
you posted once the transition is complete.

Thanks,

James


On Fri, Jan 3, 2014 at 1:11 PM, Henning Blohm <henning.blohm@zfabrik.de>wrote:

> Hi James,
>
> this is a little embarassing... I even browsed through the code and read
> it as implementing a region level index.
>
> But now at least I get the restrictions mentioned for using the covered
> indexes.
>
> Thanks for clarifying. Guess I need to browse the code a little harder ;-)
>
> Henning
>
>
> On 03.01.2014 21:53, James Taylor wrote:
>
>> Hi Henning,
>> Phoenix maintains a global index. It is essentially maintaining another
>> HBase table for you with a different row key (and a subset of your data
>> table columns that are "covered"). When an index is used by Phoenix, it is
>> *exactly* like querying a data table (that's what Phoenix does - it ends
>> up
>> issuing a Phoenix query against a Phoenix table that happens to be an
>> index
>> table).
>>
>> The hit you take for a global index is at write time - we need to look up
>> the prior state of the rows being updated to do the index maintenance.
>> Then
>> we need to do a write to the index table. The upside is that there's no
>> hit
>> at read/query time (we don't yet attempt to join from the index table back
>> to the data table - if a query is using columns that aren't in the index,
>> it simply won't be used). More here:
>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
>>
>> Thanks,
>> James
>>
>>
>> On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <henning.blohm@zfabrik.de>
>> wrote:
>>
>>  When scanning in order of an index and you use RLI, it seems, there is no
>>> alternative but to involve all regions - and essentially this should
>>> happen
>>> in parallel as otherwise you might not get what you wanted. Also, for a
>>> single Get, it seems (as Lars pointed out in https://issues.apache.org/
>>> jira/browse/HBASE-2038) that you have to consult all regions.
>>>
>>> When that parallelism is no problem (small number of servers) it will
>>> actually help single scan performance (regions can provide their share in
>>> parallel).
>>>
>>> A high number of concurrent client requests leads to the same number of
>>> requests on all regions and its multiple of connections to be maintained
>>> by
>>> the client.
>>>
>>> My assumption is that that will eventually lead to a scalability problem
>>> -
>>> when, say, having a 100 region servers or so in place. I was wondering,
>>> if
>>> anyone has experience with that.
>>>
>>> That will be perfectly acceptable for many use cases that benefit from
>>> the
>>> scan (and hence query) performance more than they suffer from the load
>>> problem. Other use cases have less requirements on scans and query
>>> flexibility but rather want to preserve the quality that a Get has fixed
>>> resource usage.
>>>
>>> Btw.: I was convinces that Phoenix is keeping indexes on the region
>>> level.
>>> Is that not so?
>>>
>>> Thanks,
>>> Henning
>>>
>>>
>>> On 03.01.2014 17:57, Anoop John wrote:
>>>
>>>  In case of HBase normal scan as we know, regions will be scanned
>>>> sequentially.  Pheonix having parallel scan impls in it.  When RLI is
>>>> used
>>>> and we make use of index completely at server side, it is irrespective
>>>> of
>>>> client scan ways. Sequential or parallel, using java or any other client
>>>> layer or using SQL layer like Phoenix, using MR or not...  all client
>>>> side
>>>> dont have to worry abt this but the usage will be fully at server end.
>>>>
>>>> Yes when parallel scan is done on regions, RLI might perform much
>>>> better.
>>>>
>>>> -Anoop-
>>>>
>>>> On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
>>>> rajeshbabu.chintaguntla@huawei.com> wrote:
>>>>
>>>>   No. the regions scanned sequentially.
>>>>
>>>>> ________________________________________
>>>>> From: Asaf Mesika [asaf.mesika@gmail.com]
>>>>> Sent: Friday, January 03, 2014 7:26 PM
>>>>> To: user@hbase.apache.org
>>>>>    Subject: Re: secondary index feature
>>>>>
>>>>> Are the regions scanned in parallel?
>>>>>
>>>>> On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:
>>>>>
>>>>>   Here are some performance numbers with RLI.
>>>>>
>>>>>> No Region servers : 4
>>>>>> Data per region    : 2 GB
>>>>>>
>>>>>> Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values|
>>>>>> Time
>>>>>> taken(sec)|
>>>>>>    50 | 200| 64|199|102
>>>>>> 50  | 200|8|199| 35
>>>>>> 100|400 | 8| 350| 95
>>>>>> 200| 800| 8| 353| 153
>>>>>>
>>>>>> Without secondary index scan is taking in hours.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Rajeshbabu
>>>>>> ________________________________________
>>>>>> From: Anoop John [anoop.hbase@gmail.com <javascript:;>]
>>>>>> Sent: Friday, January 03, 2014 3:22 PM
>>>>>> To: user@hbase.apache.org <javascript:;>
>>>>>> Subject: Re: secondary index feature
>>>>>>
>>>>>>   Is there any data on how RLI (or in particular Phoenix) query
>>>>>>
>>>>>>> throughput
>>>>>>>
>>>>>>>  correlates with the number of region servers assuming homogeneously
>>>>>> distributed data?
>>>>>>
>>>>>> Phoenix is yet to add RLI. Now it is having global indexing only.
>>>>>> Correct
>>>>>> James?
>>>>>>
>>>>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions..
>>>>>> But I
>>>>>> doubt whether it is there large no# RSs.  Do you have some data Rajesh
>>>>>> Babu?
>>>>>>
>>>>>> -Anoop-
>>>>>>
>>>>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <
>>>>>> henning.blohm@zfabrik.de
>>>>>>
>>>>>>  wrote:
>>>>>>> Jesse, James, Lars,
>>>>>>>
>>>>>>> after looking around a bit and in particular looking into Phoenix
>>>>>>>
>>>>>>>  (which
>>>>>> I
>>>>>>
>>>>>>  find very interesting), assuming that you want a secondary indexing
>>>>>>> on
>>>>>>> HBASE without adding other infrastructure, there seems to be
not a
>>>>>>> lot
>>>>>>>
>>>>>>>  of
>>>>>> choice really: Either go with a region-level (and co-processor based)
>>>>>>
>>>>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an
index
>>>>>>>
>>>>>>>  table
>>>>>> to store (index value, entity key) pairs.
>>>>>>
>>>>>>> The main concern I have with region-level indexing (RLI) is that
Gets
>>>>>>> potentially require to visit all regions. Compared to global
index
>>>>>>>
>>>>>>>  tables
>>>>>> this seems to flatten the read-scalability curve of the cluster.
In
>>>>>> our
>>>>>>
>>>>>>> case, we have a large data set (hence HBASE) that will be queried
>>>>>>>
>>>>>>>  (mostly
>>>>>> point-gets via an index) in some linear correlation with its size.
>>>>>>
>>>>>>> Is there any data on how RLI (or in particular Phoenix) query
>>>>>>>
>>>>>>>  throughput
>>>>>> correlates with the number of region servers assuming homogeneously
>>>>>>
>>>>>>> distributed data?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Henning
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 24.12.2013 12:18, Henning Blohm wrote:
>>>>>>>
>>>>>>>     All that sounds very promising. I will give it a try and
let you
>>>>>>>
>>>>>>>> know
>>>>>>>> how things worked out.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Henning
>>>>>>>>
>>>>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote:
>>>>>>>>
>>>>>>>>     The work that James is referencing grew out of the discussions
>>>>>>>> Lars
>>>>>>>>
>>>>>>>>> and I
>>>>>>>>> had (which lead to those blog posts). The solution we
implement is
>>>>>>>>> designed
>>>>>>>>> to be generic, as James mentioned above, but was written
with all
>>>>>>>>> the
>>>>>>>>> hooks
>>>>>>>>> necessary for Phoenix to do some really fast updates
(or skipping
>>>>>>>>>
>>>>>>>>>  updates
>>>>>>>>
>>>>>>> in the case where there is no change).
>>>>>>>
>>>>>>>> You should be able to plug in your own simple index builder
(there
>>>>>>>>> is
>>>>>>>>> an example
>>>>>>>>> in the phoenix codebase<https://github.com/
>>>>>>>>> forcedotcom/phoenix/tree/
>>>>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>)
>>>>>>>>> to basic solution which supports the same transactional
guarantees
>>>>>>>>> as
>>>>>>>>> HBase
>>>>>>>>> (per row) + data guarantees across the index rows. There
are more
>>>>>>>>>
>>>>>>>>>  details
>>>>>>>>
>>>>>>> in the presentations James linked.
>>>>>>>
>>>>>>>> I'd love you see if your implementation can fit into the
framework
>>>>>>>>> we
>>>>>>>>> wrote
>>>>>>>>> - we would be happy to work to see if it needs some more
hooks or
>>>>>>>>> modifications - I have a feeling this is pretty much
what you guys
>>>>>>>>>
>>>>>>>>>  will
>>>>>>>>
>>>>>>> need
>>>>>>
>>>>>>> -Jesse
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<
>>>>>>>>>
>>>>>>>>>  jtaylor@salesforce.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>>    Henning,
>>>>>>>>>
>>>>>>>>>  Jesse Yates wrote the back-end of our global secondary
indexing
>>>>>>>>>>
>>>>>>>>>>  system
>>>>>>>>>
>>>>>>>> in
>>>>>>
>>>>>>> Phoenix. He designed it as a separate, pluggable module with
no
>>>>>>>>>>
>>>>>>>>>>  Phoenix
>>>>>>>>>
>>>>>>>> dependencies. Here's an overview of the feature:
>>>>>>>
>>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing.
The
>>>>>>>>>> section that discusses the data guarantees and failure
management
>>>>>>>>>>
>>>>>>>>>>  might
>>>>>>>>>
>>>>>>>> be
>>>>>>>
>>>>>>>> of interest to you:
>>>>>>>>>>
>>>>>>>>>>   https://github.com/forcedotcom/phoenix/wiki/
>>>>>>>>>>
>>>>>>>>> Secondary-Indexing#data-
>>>>>
>>>>>  guarantees-and-failure-management
>>>>>>
>>>>>>> This presentation also gives a good overview of the pluggability
of
>>>>>>>>>>
>>>>>>>>>>  his
>>>>>>>>>
>>>>>>>>
>>>>>>  --
>>> Henning Blohm
>>>
>>> *ZFabrik Software KG*
>>>
>>> T:      +49 6227 3984255
>>> F:      +49 6227 3984254
>>> M:      +49 1781891820
>>>
>>> Lammstrasse 2 69190 Walldorf
>>>
>>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
>>> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>>> ZFabrik <http://www.zfabrik.de>
>>> Blog <http://www.z2-environment.net/blog>
>>> Z2-Environment <http://www.z2-environment.eu>
>>> Z2 Wiki <http://redmine.z2-environment.net>
>>>
>>>
>>>
>
> --
> Henning Blohm
>
> *ZFabrik Software KG*
>
> T:      +49 6227 3984255
> F:      +49 6227 3984254
> M:      +49 1781891820
>
> Lammstrasse 2 69190 Walldorf
>
> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
> ZFabrik <http://www.zfabrik.de>
> Blog <http://www.z2-environment.net/blog>
> Z2-Environment <http://www.z2-environment.eu>
> Z2 Wiki <http://redmine.z2-environment.net>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message