hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: How do I generate a histogram?
Date Mon, 09 May 2011 20:48:32 GMT
You can also just use Hive/Pig to get the answers if you code the UDFs:

select f(value), count(1) from your_table group by f(value)

Something similar in Pig

a = LOAD 'your_data.txt' AS (value:int);
b = FOREACH a GENERATE f($0);
c = GROUP b BY $0;
d = FOREACH b GENERATE group, COUNT(1);
dump d;

On Mon, May 9, 2011 at 1:22 PM, Soren Flexner <sflexner@gmail.com> wrote:

> It's word count. The mapper takes f(value) and outputs that as the key,
> with 1 as the value. The reducer outputs the key (ie f(value)) as the key
> and the sum of all the 1's as the value.
> You should be able to just tweak WordCount.java to get what you want
>
> -s
>
> On May 9, 2011, at 1:12 PM, "W.P. McNeill" <billmcn@gmail.com> wrote:
>
> > I have a set of (key, value) pairs. For each value there is a function
> > f(value) that returns an integer. I want to generate a histogram over
> > f(value) for my data set.  For example, representing the values as
> > [f(value)] if I have the data set
> >
> > key1, [3]
> > key2, [4]
> > key3, [3]
> > key4, [5]
> >
> > I'd want to produce
> >
> > 3, 2
> > 4, 1
> > 5, 1
> >
> > because f(value) = 3 appears twice in my data set while f(value) = 4 and
> > f(value) = 5 each appears once.
> >
> > I gather the right way to do this is to use the Aggregator framework, but
> I
> > can't understand the documentation.  I've read the API docs for the
> > ValueAggregatorDescriptor<
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/ValueAggregatorDescriptor.html
> >and
> > related classes and looked at the Aggreate*.java files in the examples
> > directory, but it's still not making sense to me.  (The may in part be
> due
> > to the fact that the examples are still for the old API while I'm working
> in
> > the new API, though I'm not sure.)
> >
> > Can someone point me to clearer documentation online or in print, or
> provide
> > a simple example for my task?
> >
> > Thanks.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message