datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mitul Tiwari <mitultiw...@gmail.com>
Subject Re: [jira] [Created] (DATAFU-98) New UDF for Histogram / Frequency counting
Date Sun, 12 Jul 2015 18:26:26 GMT
What about Quantile UDF in DataFu:
http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/stats/Quantile.html

Is that useful here? If not then can it be modified to cover Russell's use
case?

Thanks,
Mitul


On Sun, Jul 12, 2015 at 11:16 AM, Russell Melick (JIRA) <jira@apache.org>
wrote:

> Russell Melick created DATAFU-98:
> ------------------------------------
>
>              Summary: New UDF for Histogram / Frequency counting
>                  Key: DATAFU-98
>                  URL: https://issues.apache.org/jira/browse/DATAFU-98
>              Project: DataFu
>           Issue Type: New Feature
>             Reporter: Russell Melick
>
>
> I was thinking of creating a new UDF to compute histograms / frequency
> counts of input bags.  It seems like it would make sense to support ints,
> longs, float, and doubles.
>
> I tried looking around to see if this was already implemented, but
> ValueHistogram and AggregateWordHistogram were about the only things I
> found.  They seem to exist as an example job, and only work for Strings.
>
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
>
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
>
> Should the user specify the bin size or the number of bins?  Specifying
> bin size probably makes the implementation simpler since you can bin things
> without having seen all of the data.
>
> I think it would make sense to implement a version of this that didn't
> need any reducers.  It could use counters to keep track of the counts per
> bin without sending any data to a reducer.  You would be able to call this
> without a preceding GROUP BY as well.
>
> Here's my proposal for the two udfs.  This assumes the input data is two
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
>
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE
> BinnedFrequency(connections.numConnections);
> {code}
>
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
>
> {code}
> DEFINE BinnedFrequencyCounter
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
>
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE
> BinnedFrequencyCounter(numConnections);
> {code}
>
> The output here would just be a counter for each bin, all sharing the same
> group of numConnectionsHistogram.  It would look something like
>
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message