phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key
Date Fri, 20 Nov 2015 18:58:11 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011919#comment-15011919
] 

James Taylor edited comment on PHOENIX-2143 at 11/20/15 6:57 PM:
-----------------------------------------------------------------

I think the following row key for our stats table will be most flexible:
|PHYSICAL_NAME|VARCHAR|
|COLUMN_FAMILY|VARCHAR|
|GUIDE_POST_KEY|VARBINARY|

In this case, there'd be one row per guidepost per cf, while today there's one row per region
per cf.

We can keep {{GUIDE_POST_WIDTH}} column as a KV column. This will allow us to query the stats
table given a start/stop row key of a scan to know how many bytes will be scanned over.

When we're updating stats, we can do a query first to find the old rows that are between the
start/end key of the old region and add then to our all-or-none mutations with the puts for
our new recalculated stats.


was (Author: jamestaylor):
I think the following row key for our stats table will be most flexible:
|PHYSICAL_NAME|VARCHAR|
|COLUMN_FAMILY|VARCHAR|
|GUIDE_POST_KEY|VARBINARY|

We can keep {{GUIDE_POST_WIDTH}} column as a KV column. This will allow us to query the stats
table given a start/stop row key of a scan to know how many bytes will be scanned over.

When we're updating stats, we can do a query first to find the old rows that are between the
start/end key of the old region and add then to our all-or-none mutations with the puts for
our new recalculated stats.

> Use guidepost bytes instead of region name in stats primary key
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2143
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2143
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>
> Our current SYSTEM.STATS table uses the region name as the last column in the primary
key constraint. Instead, we should use the MIN_KEY column (which corresponds to the region
start key). The advantage would be that the stats would then be ordered by region start key
allowing us to approximate the number of guideposts which would be traversed given the start/stop
row of a scan:
> {code}
> SELECT SUM(guide_posts_count) FROM SYSTEM.STATS WHERE min_key > :1 AND min_key <
:2
> {code}
> where :1 is the start row and :2 is the stop row of the scan. With an UNNEST operator
for ARRAYs, we could get a better approximation.
> As part of the upgrade to the new Phoenix version containing this fix, stats could simply
be dropped and they'd be recalculated with the new schema.
> An alternative, even more granular approach would be to *not* use arrays to store the
guide posts, but instead store them as individual rows with a schema like this.
> |PHYSICAL_NAME|VARCHAR|
> |COLUMN_FAMILY|VARCHAR|
> |GUIDE_POST_KEY|VARBINARY|
> |GUIDE_POST_WIDTH|LONG|
> In this alternative, the maintenance during compaction is higher, though, as you'd need
to run a separate query to do the deletion of the old guideposts, followed by a commit of
the new guideposts. The other disadvantage (besides requiring multiple queries) is that this
couldn't be done transactionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message