Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4F5972004F1 for ; Wed, 30 Aug 2017 20:06:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4DA071689D5; Wed, 30 Aug 2017 18:06:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 944B616893F for ; Wed, 30 Aug 2017 20:06:04 +0200 (CEST) Received: (qmail 59484 invoked by uid 500); 30 Aug 2017 18:06:02 -0000 Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@phoenix.apache.org Delivered-To: mailing list dev@phoenix.apache.org Received: (qmail 59473 invoked by uid 99); 30 Aug 2017 18:06:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Aug 2017 18:06:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3BEA7180DFA for ; Wed, 30 Aug 2017 18:06:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mrWa9HtJB6AR for ; Wed, 30 Aug 2017 18:06:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2EAFD5FD42 for ; Wed, 30 Aug 2017 18:06:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7E6FEE00DF for ; Wed, 30 Aug 2017 18:06:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2F31723F0D for ; Wed, 30 Aug 2017 18:06:00 +0000 (UTC) Date: Wed, 30 Aug 2017 18:06:00 +0000 (UTC) From: "Ethan Wang (JIRA)" To: dev@phoenix.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 30 Aug 2017 18:06:05 -0000 [ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147704#comment-16147704 ] Ethan Wang commented on PHOENIX-418: ------------------------------------ I see. I forgot to mention that, for me the local test to work, I have to copy both the compiled phoenix-server.jar as well as a stream.jar to hbase lib folder. I will go prepare another patch > Support approximate COUNT DISTINCT > ---------------------------------- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature > Reporter: James Taylor > Assignee: Ethan Wang > Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)