phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "larsh@apache.org" <la...@apache.org>
Subject Re: Salting based on partial rowkeys
Date Mon, 17 Sep 2018 04:51:55 GMT
 I added some comments on the PHOENIX-4757

    On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser <elserj@apache.org>
wrote:  
 
 Ahh, I get you now.

For a composite primary key made up of columns 1 through N, you want 
similar controls to compute the value of the salt based on a sequence of 
the columns 1 through M where M <= N (instead of always on all columns).

For large numbers of salt buckets and a scan over a facet, you prune 
your search space considerably. Makes sense to me!

On 9/13/18 6:37 PM, Gerald Sangudi wrote:
> In case the text formatting is lost below, I also added it as a comment in
> the JIRA ticket:
> 
> https://issues.apache.org/jira/browse/PHOENIX-4757
> 
> 
> On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi <gsangudi@23andme.com>
> wrote:
> 
>> Sorry I missed Josh's reply; I've subscribed to the dev list now.
>>
>> Below is a copy-and-paste from our internal document. Thanks in advance
>> for your review and additional feedback on this.
>>
>> Gerald
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *BackgroundWe make extensive use of multi-column rowkeys and salting
>> <https://phoenix.apache.org/salted.html> in our different apache phoenix
>> deployments. We frequently perform group-by aggregations on these data
>> along a specific dimension that would benefit from predictably partitioning
>> the data along that dimension. Proposal:We propose to add table metadata to
>> allow schema designers to constrain salting to a subset of the rowkey,
>> rather than the full rowkey as it is today. This will introduce a mechanism
>> to partition data on a per-table basis along a single dimension without
>> application changes or much change to the phoenix runtime logic. We expect
>> this will result in substantially faster group-by’s along the salted
>> dimension and negligible penalties elsewhere. This feature has also been
>> proposed in PHOENIX-4757
>> <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was pointed
>> out that partitioning and sorting data along different dimensions is a
>> common pattern in other datastores as well.Theoretically, it could cause
>> hotspotting when querying along the salted dimension without the leading
>> rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE
>> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
>> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
>> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
>> relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
>> EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group
>> by id_2
>> ;+-----------------------------------------------------------------------------------------+--------+|
>>                                          PLAN    | EST_BY
>> |+-----------------------------------------------------------------------------------------+--------+|
>> CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP
>> [0,2] - [59,3]  | null ||    SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2]
>>                                        | null || CLIENT MERGE
SORT
>>                                                       
                |
>> null
>> |+-----------------------------------------------------------------------------------------+--------+3
>> rows selected (0.048 seconds)In this case, although the group by is
>> performed on both the client and regionserver, almost all of the actual
>> grouping happens on the client because the id_2’s are randomly distributed
>> across the regionservers. As a result, a lot of unnecessary data is
>> serialized to the client and grouped serially there. This can become quite
>> material with large resultsets.Proposed:Schema:CREATE TABLE relationship
>> (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val
>> SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
>> other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select
>> id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain
>> (unchanged)Under the proposal, the data are merely partitioned so that all
>> rows containing the same id_2 are on the same regionserver, the above query
>> will perform almost all of the grouping in parallel on the regionservers.
>> No special hint or changes to the query plan would be required to benefit.
>> Tables would need to be re-salted to take advantage of the new
>> functionality.Technical changes proposed to phoenix: - Create a new piece
>> of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting
>> logic to generate a salt-byte based only on the specified column. If
>> unspecified, it will behave as it does today and default to salting the
>> entire rowkey. This metadata may be specified only when the table is
>> created and may not be modified. The specified column must be part of the
>> rowkey.  - Modify all callers of getSaltingByte
>> <https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77>(byte[]
>> value, int offset, int length, int bucketNum) to consistently leverage the
>> new metadata.- Tests- DocsDesign points:One salt column vs multiple salt
>> columns: Based on the existing signature for getSaltingByte, it seems
>> simpler to only support a single SALT_COLUMN rather than multiple arbitrary
>> SALT_COLUMNS. Known use-cases are completely supported by a single
>> column.Syntax:  PHOENIX-4757
>> <https://issues.apache.org/jira/browse/PHOENIX-4757> suggests an alternate,
>> less verbose syntax for defining the salt bucket. The SALT_COLUMN syntax is
>> suggested for clarity and consistency with other Phoenix table
>> options.Future Enhancements (not in scope)Different aspects of the query
>> execution runtime could take advantage of new metadata and implied
>> knowledge that the data are partitioned in a predictable manner. For
>> example: - It could be that client side grouping is completely unnecessary
>> in cases where the SALT_COLUMN is part of the group-by expression.- A query
>> that contains a literal equality predicate for the SALT_COLUMN can be
>> isolated to a single regionserver, rather than broadcast to all
>> regionservers.- A client-side merge-sort-join based on the SALT_COLUMN
>> could optimize organization of merges. - Similarly, a server-side hash join
>> could distribute only ‘necessary’ portions of the hash table to each
>> regionserver.If additional advantages of these types come for free, then
>> that’s great but can be follow on enhancements from the initial commit.*
>>
>> On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <tdsilva@salesforce.com>
>> wrote:
>>
>>> Gerald,
>>>
>>> I think you missed Josh's reply here : https://lists.apache.org/thr
>>> ead.html/c5145461805429622a410c23c1199d578e146a5c94511b2d583
>>> 3438b@%3Cdev.phoenix.apache.org%3E
>>>
>>> Could you explain how using a subset of the pk columns to generate the
>>> salt byte helps with partitioning, aggregations etc?
>>>
>>> Thanks,
>>> Thomas
>>>
>>> On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <gsangudi@23andme.com>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> Any thoughts or feedback on this?
>>>>
>>>> Thanks,
>>>> Gerald
>>>>
>>>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <gsangudi@23andme.com>
>>>> wrote:
>>>>
>>>>> Hello folks,
>>>>>
>>>>> We have a requirement for salting based on partial, rather than full,
>>>>> rowkeys. My colleague Mike Polcari has identified the requirement and
>>>>> proposed an approach.
>>>>>
>>>>> I found an already-open JIRA ticket for the same issue:
>>>>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
>>>>> details from the proposal.
>>>>>
>>>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
>>>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
>>>>>
>>>>> The benefit at issue is that users gain more control over partitioning,
>>>>> and this can be used to push some additional aggregations and hash joins
>>>>> down to region servers.
>>>>>
>>>>> I would appreciate any go-ahead / thoughts / guidance / objections /
>>>>> feedback. I'd like to be sure that the concept at least is not
>>>>> objectionable. We would like to work on this and submit a patch down
the
>>>>> road. I'll also add a note to the JIRA ticket.
>>>>>
>>>>> Thanks,
>>>>> Gerald
>>>>>
>>>>>
>>>>
>>>
>>
> 
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message