Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@phoenix.apache.org
Date: Mon, 26 Jun 2017 05:20:00 +0000 (UTC)
From: "Ethan Wang (JIRA)" <jira@apache.org>
To: dev@phoenix.apache.org
Message-ID: <JIRA.12699925.1394495194000.101741.1498454400384@Atlassian.JIRA>
In-Reply-To: <JIRA.12699925.1394495194000@Atlassian.JIRA>
References: <JIRA.12699925.1394495194000@Atlassian.JIRA> <JIRA.12699925.1394495194677@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (PHOENIX-153) Implement TABLESAMPLE clause
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 26 Jun 2017 05:20:07 -0000


    [ https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062546#comment-16062546 ] 

Ethan Wang edited comment on PHOENIX-153 at 6/26/17 5:19 AM:
-------------------------------------------------------------

Valid Point. 

In addition, by design, this coarse problem gets magnified when three things happen (and vice versa):
1, Table is too small
2, Guidepost width set too wide, or even no stats collected at all
3, User specifies to not use stats table for parallelization. 

Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.)

!https://issues.apache.org/jira/secure/attachment/12874429/Sampling_Accuracy_Performance.jpg|height=250,width=450!

A chart that denotes the TABLESAMPLING's accuracy regarding sampled size vs expected size.
Note, 
1, The test environment is a single node, single region hbase cluster (1.3). Test table with random integer as PK, with about 400K rows.
2, The guide post width has been pre set as 10K and 200K, respectively
3, The consistent hashing algorithm used in TableSamplerPredicate (a.k.a, the dice, used to hashing scan.star_rowkey to decide if a guidepost is going to be selected), is implemented as FNV


was (Author: aertoria):
Valid Point. 

In addition, by design, this coarse problem gets magnified when three things happen (and vice versa):
1, Table is too small
2, Guidepost width set too wide, or even no stats collected at all
3, User specifies to not use stats table for parallelization. 

Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.)

!https://issues.apache.org/jira/secure/attachment/12874429/Sampling_Accuracy_Performance.jpg!

A chart that denotes the TABLESAMPLING's accuracy regarding sampled size vs expected size.
Note, 
1, The test environment is a single node, single region hbase cluster (1.3). Test table with random integer as PK, with about 400K rows.
2, The guide post width has been pre set as 10K and 200K, respectively
3, The consistent hashing algorithm used in TableSamplerPredicate (a.k.a, the dice, used to hashing scan.star_rowkey to decide if a guidepost is going to be selected), is implemented as FNV

> Implement TABLESAMPLE clause
> ----------------------------
>
>                 Key: PHOENIX-153
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-153
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: James Taylor
>            Assignee: Ethan Wang
>              Labels: enhancement
>         Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)