Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1C82D200CC1 for ; Mon, 26 Jun 2017 07:20:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1B401160BF4; Mon, 26 Jun 2017 05:20:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 61F5A160BE0 for ; Mon, 26 Jun 2017 07:20:06 +0200 (CEST) Received: (qmail 33984 invoked by uid 500); 26 Jun 2017 05:20:05 -0000 Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@phoenix.apache.org Delivered-To: mailing list dev@phoenix.apache.org Received: (qmail 33973 invoked by uid 99); 26 Jun 2017 05:20:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jun 2017 05:20:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id DBBCAC086E for ; Mon, 26 Jun 2017 05:20:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.211 X-Spam-Level: X-Spam-Status: No, score=-99.211 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WR10Pa-3wW_f for ; Mon, 26 Jun 2017 05:20:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3D81D5FB71 for ; Mon, 26 Jun 2017 05:20:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 3B580E0ADD for ; Mon, 26 Jun 2017 05:20:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5DF6D240C0 for ; Mon, 26 Jun 2017 05:20:00 +0000 (UTC) Date: Mon, 26 Jun 2017 05:20:00 +0000 (UTC) From: "Ethan Wang (JIRA)" To: dev@phoenix.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (PHOENIX-153) Implement TABLESAMPLE clause MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 26 Jun 2017 05:20:07 -0000 [ https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062546#comment-16062546 ] Ethan Wang edited comment on PHOENIX-153 at 6/26/17 5:19 AM: ------------------------------------------------------------- Valid Point. In addition, by design, this coarse problem gets magnified when three things happen (and vice versa): 1, Table is too small 2, Guidepost width set too wide, or even no stats collected at all 3, User specifies to not use stats table for parallelization. Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.) !https://issues.apache.org/jira/secure/attachment/12874429/Sampling_Accuracy_Performance.jpg|height=250,width=450! A chart that denotes the TABLESAMPLING's accuracy regarding sampled size vs expected size. Note, 1, The test environment is a single node, single region hbase cluster (1.3). Test table with random integer as PK, with about 400K rows. 2, The guide post width has been pre set as 10K and 200K, respectively 3, The consistent hashing algorithm used in TableSamplerPredicate (a.k.a, the dice, used to hashing scan.star_rowkey to decide if a guidepost is going to be selected), is implemented as FNV was (Author: aertoria): Valid Point. In addition, by design, this coarse problem gets magnified when three things happen (and vice versa): 1, Table is too small 2, Guidepost width set too wide, or even no stats collected at all 3, User specifies to not use stats table for parallelization. Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.) !https://issues.apache.org/jira/secure/attachment/12874429/Sampling_Accuracy_Performance.jpg! A chart that denotes the TABLESAMPLING's accuracy regarding sampled size vs expected size. Note, 1, The test environment is a single node, single region hbase cluster (1.3). Test table with random integer as PK, with about 400K rows. 2, The guide post width has been pre set as 10K and 200K, respectively 3, The consistent hashing algorithm used in TableSamplerPredicate (a.k.a, the dice, used to hashing scan.star_rowkey to decide if a guidepost is going to be selected), is implemented as FNV > Implement TABLESAMPLE clause > ---------------------------- > > Key: PHOENIX-153 > URL: https://issues.apache.org/jira/browse/PHOENIX-153 > Project: Phoenix > Issue Type: Task > Reporter: James Taylor > Assignee: Ethan Wang > Labels: enhancement > Attachments: Sampling_Accuracy_Performance.jpg > > > Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)