Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 805F117BAC for ; Sun, 16 Aug 2015 09:38:57 +0000 (UTC) Received: (qmail 44364 invoked by uid 500); 16 Aug 2015 09:38:57 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 44317 invoked by uid 500); 16 Aug 2015 09:38:57 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 44307 invoked by uid 99); 16 Aug 2015 09:38:57 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Aug 2015 09:38:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B74C71827D7 for ; Sun, 16 Aug 2015 09:38:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.995 X-Spam-Level: X-Spam-Status: No, score=0.995 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.006, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 47EWrM2ChIpJ for ; Sun, 16 Aug 2015 09:38:46 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with SMTP id 8BC08428DF for ; Sun, 16 Aug 2015 09:38:45 +0000 (UTC) Received: (qmail 44172 invoked by uid 99); 16 Aug 2015 09:38:45 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Aug 2015 09:38:45 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id EED08DFD7B; Sun, 16 Aug 2015 09:38:44 +0000 (UTC) From: chiwanpark To: issues@flink.incubator.apache.org Reply-To: issues@flink.incubator.apache.org References: In-Reply-To: Subject: [GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog... Content-Type: text/plain Message-Id: <20150816093844.EED08DFD7B@git1-us-west.apache.org> Date: Sun, 16 Aug 2015 09:38:44 +0000 (UTC) Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37144137 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,100 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: FlinkML - Statistics +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + 1. Continuous Histograms: These histograms are formed on a data set `X: + DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can + be construed as a cumulative probability value at $s$[Of course, scaled probability]. --- End diff -- `` tag can be replace by `*`. `scaled` can be represented as `*scaled*`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---