For Kmenas, that is.
For other methods, there is either a set of sufficient statistics analogous
to the sum and count, or an approximation of that or the combiner can't be
used.
For instance, if you have something like a median, you can pass around a
sample of (say) at most 100 points and a count of how many points these
represent. Addition of two sets would consist of sampling from two sets in
proportion to the number of elements each sample represents. In the end,
you have up to 100 points randomly sampled from everything that the reducer
would have seen which can give you a decent measure of the median.
This isn't as good as the sums because the samples are bigger and the median
is only approximated. It does deal with the problem of massive data going
to the reducer in the event of imbalance.
On Thu, Jun 11, 2009 at 12:54 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> For L_2 centroids, you just have to have the mapper emit a trivial sum and
> a count (of 1). The combiner should take a list of vector sums and counts
> and produce a combined sum and count.
>
> Then the reducer will get a sums and counts and it should add them together
> and divide by the count.
>
> (just like ndimensional word count!)
>
>
> On Thu, Jun 11, 2009 at 9:49 AM, Adil Aijaz <adil@yahooinc.com> wrote:
>
>> Jeff,
>>
>> Thanks for the quick turnaround on this issue. Just tested it and the
>> canopy creation and kmeans both work now on syntheticcontroldata. I get 7
>> canopies and 7 clusters. Collection logic in close() is not pretty but can't
>> think of a workaround myself.
>>
>> adil
>>
>>
>> Jeff Eastman wrote:
>>
>>> r783617 removed the CanopyCombiner and refactored its semantics back into
>>> the reducer. Updated unit tests pass and Synthetic Control with Canopy
>>> produces 6 clusters. Kmeans also runs produces 6 clusters too. I really
>>> don't like doing stuff in close() but see no practical alternative. Ideas
>>> are still welcomed.
>>>
>>> Jeff
>>>
>>>
>>> Jeff Eastman wrote:
>>>
>>>> Adil Aijaz wrote:
>>>>
>>>>> 2. There is a bug in
>>>>> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
>>>>> that called runJob from main function with my provided arguments transposed.
>>>>> So, my convergenceDelta was interpreted as t1, t1 as t2, and t2 as
>>>>> convergenceDelta. I will commit a patch as soon as I get approval for
>>>>> opensource commits from my employer, however, I thought I'd put it out
there
>>>>> in case someone else is going through the same issue.
>>>>>
>>>>> r783585 fixed the parameter ordering bug. Still working on the
>>>> Combiner problem.
>>>>
>>>> Thanks Adil,
>>>> Jeff
>>>>
>>>>
>>>>
>>>
>>
>
>
> 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 8584140013 (m)
> 4087730220 (fax)
>
>

Ted Dunning, CTO
DeepDyve
111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
8584140013 (m)
4087730220 (fax)
