Hi all
I'm new to using Hadoop so I'm hoping to get a little guidance on what
the best way to solve a particular class of problems would be.
The general use case is this: from a very small set of data, I will
generate a massive set of pairs of values, ie, <A,B>. I would like to
compute the maximum likelihood estimate (MLE) of the conditional
probability P(AB). However, it is very obvious to me how to compute
the counts of C(<A,B>) and equally obvious how to compute the counts
C(<A,*>) or C(<*,B>), but what I need is: C(<A,B>)/C(<*,B>).
My approach:
My initial approach to the decomposition of this problem is to use a
mapper to go from my input data to <A,B> pairs, and then a reducer to
go for <A,B> pairs to counts C(A,B). However, at that point, I'd like
a second reducer like thing (call it Normalize) to run which
aggregates all the C(*,B) pairs and returns a value P(AB) for each A
that occurs with B. This is where things get fuzzy for me. How do I
do this? A reducer can only return a single value (for example, if I
make B the key for Normalize it could return C(B) very easily). What
I need is a value type that reduce can return that is essential a list
of (key,value) pairs. Does such a thing exist? Am I approaching this
the wrong way?
Thanks for any assistance!
Chris

Chris Dyer
Dept. of Linguistics
University of Maryland
1401 Marie Mount Hall
College Park MD 20742
