Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBA031769D for ; Wed, 9 Sep 2015 02:51:45 +0000 (UTC) Received: (qmail 43071 invoked by uid 500); 9 Sep 2015 02:51:45 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 43023 invoked by uid 500); 9 Sep 2015 02:51:45 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 43013 invoked by uid 99); 9 Sep 2015 02:51:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2015 02:51:45 +0000 Date: Wed, 9 Sep 2015 02:51:45 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-2533) Gap based random sample optimization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736042#comment-14736042 ] ASF GitHub Bot commented on FLINK-2533: --------------------------------------- Github user ChengXiangLi commented on a diff in the pull request: https://github.com/apache/flink/pull/1110#discussion_r39002862 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/sampling/PoissonSampler.java --- @@ -93,29 +98,59 @@ public boolean hasNext() { } } } - - private void moveToNextElement() { - while (input.hasNext()) { + + public int poisson_ge1(double p){ + // sample 'k' from Poisson(p), conditioned to k >= 1 + double q = Math.pow(Math.E, -p); + // simulate a poisson trial such that k >= 1 + double t = q + (1 - q)*random.nextDouble(); + int k = 1; + // continue standard poisson generation trials + t = t * random.nextDouble(); + while (t > q) { + k++; + t = t * random.nextDouble(); + } + return k; + } + + private void moveToNextElement(int num) { + // skip elements with replication factor zero + int elementCount = 0; + while (input.hasNext() && elementCount < num){ currentElement = input.next(); - currentCount = poissonDistribution.sample(); - if (currentCount > 0) { - break; + elementCount++; + } + } + + private void samplingProcess(){ + if (fraction <= THRESHOLD) { + double u = Math.max(random.nextDouble(), EPSILON); + int gap = (int) (Math.log(u) / -fraction); + moveToNextElement(gap); + if (input.hasNext()) { + currentElement = input.next(); + currentCount = poisson_ge1(fraction); + } + } + else { --- End diff -- Format: else should follow the right curly brace in the same line. > Gap based random sample optimization > ------------------------------------ > > Key: FLINK-2533 > URL: https://issues.apache.org/jira/browse/FLINK-2533 > Project: Flink > Issue Type: Improvement > Components: Core > Reporter: Chengxiang Li > Priority: Minor > > For random sampler with fraction, like BernoulliSampler and PoissonSampler, Gap based random sampler could exploit O(k) sample implementation instead of previous O\(n\) sample implementation, it should perform better while sample fraction is very small. [This blog|http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/] describes more detail about gap based random sampler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)