phoenix-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-4925) Use a Variant Segment tree to organize Guide Post Info
Date Sat, 04 May 2019 00:12:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832947#comment-16832947
] 

Hadoop QA commented on PHOENIX-4925:
------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12967825/PHOENIX-4925.phoenix-stats.0502.patch
  against master branch at commit bb1327ef89fb0844094470ada74cbe5071b43a0d.
  ATTACHMENT ID: 12967825

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 17 new or modified
tests.

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/2554//console

This message is automatically generated.

> Use a Variant Segment tree to organize Guide Post Info
> ------------------------------------------------------
>
>                 Key: PHOENIX-4925
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4925
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Bin Shi
>            Assignee: Bin Shi
>            Priority: Major
>         Attachments: PHOENIX-4925.phoenix-stats.0502.patch
>
>          Time Spent: 6h
>  Remaining Estimate: 0h
>
> As reported, Query compilation (for the sample queries showed below), especially deriving
estimation and generating parallel scans from guide posts, becomes much slower after we introduced
Phoenix Stats. 
>  a. SELECT f1__c FROM MyCustomBigObject__b ORDER BY Pk1__c
>  b. SELECT f1__c FROM MyCustomBigObject__b WHERE nonpk1__c = ‘x’ ORDER BY Pk1__c
>  c. SELECT f1__c FROM MyCustomBigObject__b WHERE pk2__c = ‘x’ ORDER BY pk1__c,pk2__c
>  d. SELECT f1__c FROM MyCustomBigObject__b WHERE pk1__c = ‘x’ AND nonpk1__c ORDER
BY pk1__c,pk2__c
>  e. SELECT f1__c FROM MyCustomBigObject__b WHERE pk__c >= 'd' AND pk__c <= 'm'
OR pk__c >= 'o' AND pk__c <= 'x' ORDER BY pk__c // pk__c is the only column to make
the primary key.
>   
>  By using prefix encoding for guide post info, we have to decode and traverse guide posts
sequentially, which causes time complexity in BaseResultIterators.getParallelScan(...) to
be O( n ) , where n is the total count of guide posts.
> According to PHOENIX-2417, to reduce footprint in client cache and over transmition,
the prefix encoding is used as in-memory and over-the-wire encoding for guide post info.
> We can use Segment Tree to address both memory and performance concerns. The guide posts
are partitioned to k chunks (k=1024?), each chunk is encoded by prefix encoding and the
encoded data is a leaf node of the tree. The inner node contains summary info (the count of
rows, the data size) of the sub tree rooted at the inner node.
> With this tree like data structure, compared to the current data structure, the increased
size (mainly coming from the n/k-1 inner nodes) is ignorable. The time complexity for queries
a, b, c can be reduced to O(m) where m is the total count of regions; the time complexity
for "EXPLAN" queries a, b, c can be reduced to O(m) too, and if we support "EXPLAIN (ESTIMATE
ONLY)", it can even be reduced to O(1). For queries d and e, the time complexity to find the
start of target scan ranges can be reduced to O(log(n/k)).
> The tree can also integrate AVL and B+ characteristics to support partial load/unload
when interacting with stats client cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message