Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EA2EE2004A1 for ; Thu, 24 Aug 2017 22:35:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E881C16B798; Thu, 24 Aug 2017 20:35:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3DE7416B79F for ; Thu, 24 Aug 2017 22:35:07 +0200 (CEST) Received: (qmail 6669 invoked by uid 500); 24 Aug 2017 20:35:05 -0000 Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@phoenix.apache.org Delivered-To: mailing list dev@phoenix.apache.org Received: (qmail 6462 invoked by uid 99); 24 Aug 2017 20:35:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Aug 2017 20:35:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2615AC5B0B for ; Thu, 24 Aug 2017 20:35:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id YTwcAp-CCCjM for ; Thu, 24 Aug 2017 20:35:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id BF37A5FBBA for ; Thu, 24 Aug 2017 20:35:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1B3AEE0E1B for ; Thu, 24 Aug 2017 20:35:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7A1A225389 for ; Thu, 24 Aug 2017 20:35:00 +0000 (UTC) Date: Thu, 24 Aug 2017 20:35:00 +0000 (UTC) From: "Ethan Wang (JIRA)" To: dev@phoenix.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (PHOENIX-418) Support approximate COUNT DISTINCT MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 24 Aug 2017 20:35:08 -0000 [ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140468#comment-16140468 ] Ethan Wang edited comment on PHOENIX-418 at 8/24/17 8:34 PM: ------------------------------------------------------------- Thanks for suggestions [~jamestaylor]. {quote}Make sure to call TestUtil.analyzeTable(connection, fullTableName) prior to running your TABLESAMPLE queries. You'll get more rows back, since you'll have guideposts in addition to region boundaries.{quote} Correct me if I'm mistaken. Within the test for approximate count distinct, table sampling technique was not used since HyperLogLog doesn't rely on sampling (hence no run time speed gain during counting). So updating guidepost may have less impact on approximate count distinct. was (Author: aertoria): Thanks for suggestions [~jamestaylor]. bq. Make sure to call TestUtil.analyzeTable(connection, fullTableName) prior to running your TABLESAMPLE queries. You'll get more rows back, since you'll have guideposts in addition to region boundaries. If I understand this part right, within the test for approximate count distinct, table sampling technique was not used since HyperLogLog doesn't rely on sampling (hence no run time speed gain during counting). > Support approximate COUNT DISTINCT > ---------------------------------- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task > Reporter: James Taylor > Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)