Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5606118A06 for ; Thu, 20 Aug 2015 17:08:20 +0000 (UTC) Received: (qmail 25245 invoked by uid 500); 20 Aug 2015 17:07:46 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 25198 invoked by uid 500); 20 Aug 2015 17:07:45 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 25188 invoked by uid 99); 20 Aug 2015 17:07:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Aug 2015 17:07:45 +0000 Date: Thu, 20 Aug 2015 17:07:45 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-2030) Implement an online histogram with Merging and equalization features MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705294#comment-14705294 ] ASF GitHub Bot commented on FLINK-2030: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37554555 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,69 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: FlinkML - Statistics +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and + `createDiscreteHistogram`. + +### Creating a histogram + + There are two types of histograms: + 1. **Continuous Histograms**: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can + be construed as a cumulative probability value at $s$[Of course, *scaled* probability]. --- End diff -- I understand what you want to say, but I think it's not well formulated. IMO it's better to clearly define what `sum(s)` or better what `count(s)` means. E.g. "The value sum(s) represents the number of elements in X whose value is less than s" as you've said. But the rest is not necessary. > Implement an online histogram with Merging and equalization features > -------------------------------------------------------------------- > > Key: FLINK-2030 > URL: https://issues.apache.org/jira/browse/FLINK-2030 > Project: Flink > Issue Type: Sub-task > Components: Machine Learning Library > Reporter: Sachin Goel > Assignee: Sachin Goel > Priority: Minor > Labels: ML > > For the implementation of the decision tree in https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an histogram with online updates, merging and equalization features. A reference implementation is provided in [1] > [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)