Return-Path: X-Original-To: apmail-datafu-dev-archive@minotaur.apache.org Delivered-To: apmail-datafu-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6291D18270 for ; Mon, 13 Jul 2015 02:02:06 +0000 (UTC) Received: (qmail 14411 invoked by uid 500); 13 Jul 2015 02:02:06 -0000 Delivered-To: apmail-datafu-dev-archive@datafu.apache.org Received: (qmail 14364 invoked by uid 500); 13 Jul 2015 02:02:06 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 14352 invoked by uid 99); 13 Jul 2015 02:02:06 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jul 2015 02:02:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B2297C092D for ; Mon, 13 Jul 2015 02:02:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.97 X-Spam-Level: X-Spam-Status: No, score=0.97 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id w6v_f3C_tKE3 for ; Mon, 13 Jul 2015 02:02:04 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with SMTP id B3A9C20EFB for ; Mon, 13 Jul 2015 02:02:04 +0000 (UTC) Received: (qmail 14337 invoked by uid 99); 13 Jul 2015 02:02:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jul 2015 02:02:04 +0000 Date: Mon, 13 Jul 2015 02:02:04 +0000 (UTC) From: "Russell Melick (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624120#comment-14624120 ] Russell Melick commented on DATAFU-98: -------------------------------------- Posted RB: https://reviews.apache.org/r/36439/ > New UDF for Histogram / Frequency counting > ------------------------------------------ > > Key: DATAFU-98 > URL: https://issues.apache.org/jira/browse/DATAFU-98 > Project: DataFu > Issue Type: New Feature > Reporter: Russell Melick > Attachments: DATAFU-98.patch > > > I was thinking of creating a new UDF to compute histograms / frequency counts of input bags. It seems like it would make sense to support ints, longs, float, and doubles. > I tried looking around to see if this was already implemented, but ValueHistogram and AggregateWordHistogram were about the only things I found. They seem to exist as an example job, and only work for Strings. > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html > https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html > Should the user specify the bin size or the number of bins? Specifying bin size probably makes the implementation simpler since you can bin things without having seen all of the data. > I think it would make sense to implement a version of this that didn't need any reducers. It could use counters to keep track of the counts per bin without sending any data to a reducer. You would be able to call this without a preceding GROUP BY as well. > Here's my proposal for the two udfs. This assumes the input data is two columns, memberId and numConnections. > {code} > DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50') > connections = LOAD 'connections' AS memberId, numConnections; > connectionHistogram = FOREACH (GROUP connections ALL) GENERATE BinnedFrequency(connections.numConnections); > {code} > The output here would be a bag with the frequency counts > {code} > {('0-49', 5), ('50-99', 0), ('100-149', 10)} > {code} > {code} > DEFINE BinnedFrequencyCounter datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram') > connections = LOAD 'connections' AS memberId, numConnections; > connections = FOREACH connections GENERATE BinnedFrequencyCounter(numConnections); > {code} > The output here would just be a counter for each bin, all sharing the same group of numConnectionsHistogram. It would look something like > numConnectionsHistogram.'0-49' = 5 > numConnectionsHistogram.'50-99' = 0 > numConnectionsHistogram.'100-149' = 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)