phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samarth Jain (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PHOENIX-3836) Estimated row count is twice the actual row count when stats are updated via major compaction
Date Wed, 07 Jun 2017 01:33:18 GMT

     [ https://issues.apache.org/jira/browse/PHOENIX-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Samarth Jain updated PHOENIX-3836:
----------------------------------
    Attachment: PHOENIX-3836.patch

Patch with test that repros the issue along with the fix. It turned out that, at least in
0.98, when HBase runs major compaction, it imposes a limit on the number of key values that
can be returned in one internalScanner.next() call. As a result, in our DefaultStatisticsCollector,
we may end up counting the row more than once. The issue is reproducible only when the number
of key values in a row is greater than 10 (which is the default for hbase.hstore.compaction.kv.max).

[~jamestaylor], please review.

> Estimated row count is twice the actual row count when stats are updated via major compaction
> ---------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3836
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3836
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>            Priority: Minor
>         Attachments: PHOENIX-3836.patch
>
>
> Estimated row count for a 2M table is 3986498 after stats updated via major compaction
vs 1993250 with {{update statistics}}.
> {noformat}
> Explain plan for count(*) on 2M row table after major compaction:
> +--------------------------------------------------------------------------------------+
> |                                         PLAN                                      
  |
> +--------------------------------------------------------------------------------------+
> | CLIENT 364-CHUNK 3986498 ROWS 3774892993 BYTES PARALLEL 1-WAY FULL SCAN OVER T  |
> |     SERVER FILTER BY FIRST KEY ONLY                                               
  |
> |     SERVER AGGREGATE INTO SINGLE ROW                                              
  |
> +--------------------------------------------------------------------------------------+
> Explain plan for count(*) on 2M row table after update statistics:
> +--------------------------------------------------------------------------------------+
> |                                         PLAN                                      
  |
> +--------------------------------------------------------------------------------------+
> | CLIENT 364-CHUNK 1993250 ROWS 3774892993 BYTES PARALLEL 1-WAY FULL SCAN OVER T  |
> |     SERVER FILTER BY FIRST KEY ONLY                                               
  |
> |     SERVER AGGREGATE INTO SINGLE ROW                                              
  |
> +--------------------------------------------------------------------------------------+
> {noformat}
> Following schema was used with 2M rows and 10MB guidepost width:
> {noformat}
> CREATE TABLE IF NOT EXISTS T (PKA CHAR(15) NOT NULL, PKF CHAR(3) NOT NULL,
>  PKP CHAR(15) NOT NULL, CRD DATE NOT NULL, EHI CHAR(15) NOT NULL, STD_COL VARCHAR, INDEXED_COL
INTEGER,
>  CONSTRAINT PK PRIMARY KEY ( PKA, PKF, PKP, CRD DESC, EHI))
>  VERSIONS=1,MULTI_TENANT=true,IMMUTABLE_ROWS=true
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message