mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Clustering a large crawl
Date Mon, 04 Jun 2012 16:14:50 GMT
Even having millions of dimensions isn't all that bad if that induces a
reasonable distance between documents.  The easy way to test that is to use
several document vectors as queries and see whether the closest other
documents appear to you to be very similar.  If this is true for a number
of documents, you should be good to go with whatever metric you are using.

For fast clustering, you may need a low-dimensional surrogate metric so
that you can get higher throughput, but the point of the low-dimensional
surrogate is that it *replicates* the behavior of the metric that you
really want.  It isn't going to make your metric better.

On Mon, Jun 4, 2012 at 5:15 PM, Pat Ferrel <> wrote:

> After looking again at the dictionary for 150,000 web pages I have 259,000
> dimensions! Part of the problem is I can't get Tika to detect language very
> well (working on this) so I get groups of non-english pages that throw in
> quite a few new terms. Overall I think some form of dimensional reduction
> is called for, no?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message