Thank you Jeff for your advice,
I think that the problems I encounter are characteristic for the structure of our dataset.
The cardinality of the vectors is 20K, whereas an average number of nonzero coordinates is
~50. I checked with a sample that on average 12% of the distances between the vectors are
maximum (i.e. there is no overlap in the nonzero coordinates). Moreover, the same values
of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the
distances among the centroids transferred to the reducer probably have different distribution
than the distances between pure vectors.
The process blows up either at the very begining (too many centroids are created in mappers)
or after the mappers transfer the centroids to the reducer (as I see there is only one reducer
hardcoded and everything has to be processed by one node)
Cheers
Szymon
Dnia 28 lutego 2011 22:25 Jeff Eastman <jeastman@Narus.com> napisaĆ(a):
> Canopy can be difficult to control and it appears you may have found a use case for not
enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign
points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1)
in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow
the rules and give you the same number of clusters, but it would also add the centers of the
outliers (dist>1.15). Is this where your processing time blows up?
>
> Original Message
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> Sent: Monday, February 28, 2011 11:55 AM
> To: user@mahout.apache.org
> Subject: T1 and T2 in Canopy
>
> Hello,
>
> I am working with my colleague Tim within a Mahout588 project (https://issues.apache.org/jira/browse/MAHOUT588).
The goal of the project is to compare mahout's clustering algorithms with ApacheMailArchives
dataset (6 million emails). I have spent last few days trying to set such values of T1 and
T2, which would give a nontrivial set of clusters (>1 and < # of all vectors). And
would output the result within e.g. up to 3h.
>
> I would be greatful for your advice, as the only way I can do it was by breaking the
rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many nonempty
coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1
results in low T2, which leads to large number of canopies. And the same problem with memory
and cpu.
>
> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15
and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
>
> Thank you in advance for you suggestions on setting T1 and T2, and the importance of
T1>T2 constraint.
>
> Kind regards
> Szymon
>
> ps.
> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout588_canopy.pdf.
>
>

Szymon Chojnacki
http://www.ipipan.eu/~sch/
