phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PHOENIX-3670) KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially for join dynamic filter
Date Tue, 14 Feb 2017 19:21:41 GMT

     [ https://issues.apache.org/jira/browse/PHOENIX-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Taylor updated PHOENIX-3670:
----------------------------------
    Fix Version/s: 4.10.0

> KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially
for join dynamic filter
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3670
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3670
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.9.0
>            Reporter: chenglei
>            Assignee: chenglei
>             Fix For: 4.10.0
>
>         Attachments: PHOENIX-3670_v1.patch
>
>
> In my business system, there is a following join SQL(which is simplified), fact_table
is a fact table,  joining dimension table dim_table1 and dim_table2 : 
> {code:borderStyle=solid} 
> select /*+ SKIP_SCAN */ sum(t.click)  from fact_table t join dim_table1 d1 on  t.cust_id=d1.id
 join dim_table2 d2 on t.cust_id =d2.id  where t.date between '2016-01-01' and '2017-01-01'
and d1.code = 2008 and d2.region = 'us';
> {code} 
> I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small dataset, the
sql executes quickly, but when the dataset is bigger, the sql becomes very slowly,when the
 row count of fact_table is 30 million,dim_table1 is 300 thousand and dim_table2 is 100 thousand,
the above query  costs 17s.
> When I debug the SQL executing, I find RHS1 return 5523 rows:
> {code:borderStyle=solid} 
>    select d1.id from dim_table1 d1 where d1.code = 2008
> {code} 
> and RHS2 return 23881 rows: 
> {code:borderStyle=solid}
>    select d2.id from dim_table2 d2 where d2.region='us'
> {code}  
> then HashJoinPlan uses  KeyRange.intersect(List<KeyRange> , List<KeyRange>
) method to compute RHS1 intersecting RHS2 for join dynamic filter, narrowing down fact_table.cust_id
should be. 
> Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql execution
only costs 17s.After I read the code of  KeyRange.intersect method,I find following two problem:
> (1) The double loop is inefficient in line 521 and line 522,when keyRanges  size is M,
keyRanges2 size is N, the time complexity is O(M*N), for my example,is 5523*23881: 
> {code:borderStyle=solid} 
> 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges,  List<KeyRange>
keyRanges2) {
> 520        List<KeyRange> tmp = new ArrayList<KeyRange>();
> 521        for (KeyRange r1 : keyRanges) {
> 522            for (KeyRange r2 : keyRanges2) {
> 523                KeyRange r = r1.intersect(r2);
> 524                if (EMPTY_RANGE != r) {
> 525                    tmp.add(r);
> 526                }
> 527            }
> 528        }
> {code}  
> (2) line 540 shoule be r = r.union(tmp.get( i )), not intersect, just as KeyRange.coalesce
method does:
> {code:borderStyle=solid} 
> 532        Collections.sort(tmp, KeyRange.COMPARATOR);
> 533        List<KeyRange> tmp2 = new ArrayList<KeyRange>();
> 534        KeyRange r = tmp.get(0);
> 535        for (int i=1; i<tmp.size(); i++) {
> 536            if (EMPTY_RANGE == r.intersect(tmp.get(i))) {
> 537                tmp2.add(r);
> 538                r = tmp.get(i);
> 539            } else {
> 540                r = r.intersect(tmp.get(i));
> 541            }
> 542        }
> {code}
> and it seems that no unit tests for this KeyRange.intersect(List<KeyRange> , List<KeyRange>)
method. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message