phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerald Sangudi <>
Subject Re: Salting based on partial rowkeys
Date Thu, 13 Sep 2018 22:24:54 GMT
Sorry I missed Josh's reply; I've subscribed to the dev list now.

Below is a copy-and-paste from our internal document. Thanks in advance for
your review and additional feedback on this.


*BackgroundWe make extensive use of multi-column rowkeys and salting
<> in our different apache phoenix
deployments. We frequently perform group-by aggregations on these data
along a specific dimension that would benefit from predictably partitioning
the data along that dimension. Proposal:We propose to add table metadata to
allow schema designers to constrain salting to a subset of the rowkey,
rather than the full rowkey as it is today. This will introduce a mechanism
to partition data on a per-table basis along a single dimension without
application changes or much change to the phoenix runtime logic. We expect
this will result in substantially faster group-by’s along the salted
dimension and negligible penalties elsewhere. This feature has also been
proposed in PHOENIX-4757
<> where it was pointed
out that partitioning and sorting data along different dimensions is a
common pattern in other datastores as well.Theoretically, it could cause
hotspotting when querying along the salted dimension without the leading
rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE
TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group
by id_2
                                         PLAN    | EST_BY
[0,2] - [59,3]  | null ||     SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2]
                                      | null || CLIENT MERGE SORT
rows selected (0.048 seconds)In this case, although the group by is
performed on both the client and regionserver, almost all of the actual
grouping happens on the client because the id_2’s are randomly distributed
across the regionservers. As a result, a lot of unnecessary data is
serialized to the client and grouped serially there. This can become quite
material with large resultsets.Proposed:Schema:CREATE TABLE relationship
other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select
id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain
(unchanged)Under the proposal, the data are merely partitioned so that all
rows containing the same id_2 are on the same regionserver, the above query
will perform almost all of the grouping in parallel on the regionservers.
No special hint or changes to the query plan would be required to benefit.
Tables would need to be re-salted to take advantage of the new
functionality.Technical changes proposed to phoenix: - Create a new piece
of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting
logic to generate a salt-byte based only on the specified column. If
unspecified, it will behave as it does today and default to salting the
entire rowkey. This metadata may be specified only when the table is
created and may not be modified. The specified column must be part of the
rowkey.  - Modify all callers of getSaltingByte
value, int offset, int length, int bucketNum) to consistently leverage the
new metadata.- Tests- DocsDesign points:One salt column vs multiple salt
columns: Based on the existing signature for getSaltingByte, it seems
simpler to only support a single SALT_COLUMN rather than multiple arbitrary
SALT_COLUMNS. Known use-cases are completely supported by a single
column.Syntax:  PHOENIX-4757
<> suggests an alternate,
less verbose syntax for defining the salt bucket. The SALT_COLUMN syntax is
suggested for clarity and consistency with other Phoenix table
options.Future Enhancements (not in scope)Different aspects of the query
execution runtime could take advantage of new metadata and implied
knowledge that the data are partitioned in a predictable manner. For
example: - It could be that client side grouping is completely unnecessary
in cases where the SALT_COLUMN is part of the group-by expression.- A query
that contains a literal equality predicate for the SALT_COLUMN can be
isolated to a single regionserver, rather than broadcast to all
regionservers.- A client-side merge-sort-join based on the SALT_COLUMN
could optimize organization of merges. - Similarly, a server-side hash join
could distribute only ‘necessary’ portions of the hash table to each
regionserver.If additional advantages of these types come for free, then
that’s great but can be follow on enhancements from the initial commit.*

On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <>

> Gerald,
> I think you missed Josh's reply here :
> thread.html/c5145461805429622a410c23c1199d578e146a5c94511b2d5833438b@%
> Could you explain how using a subset of the pk columns to generate the
> salt byte helps with partitioning, aggregations etc?
> Thanks,
> Thomas
> On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <>
> wrote:
>> Hi folks,
>> Any thoughts or feedback on this?
>> Thanks,
>> Gerald
>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <>
>> wrote:
>>> Hello folks,
>>> We have a requirement for salting based on partial, rather than full,
>>> rowkeys. My colleague Mike Polcari has identified the requirement and
>>> proposed an approach.
>>> I found an already-open JIRA ticket for the same issue:
>>> I can provide more
>>> details from the proposal.
>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
>>> The benefit at issue is that users gain more control over partitioning,
>>> and this can be used to push some additional aggregations and hash joins
>>> down to region servers.
>>> I would appreciate any go-ahead / thoughts / guidance / objections /
>>> feedback. I'd like to be sure that the concept at least is not
>>> objectionable. We would like to work on this and submit a patch down the
>>> road. I'll also add a note to the JIRA ticket.
>>> Thanks,
>>> Gerald

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message