mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering a large crawl
Date Tue, 05 Jun 2012 07:59:27 GMT
Yes.  This indicates that you should be able to get good clustering from
that metric.

On Mon, Jun 4, 2012 at 10:58 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> I do have rowsimilarity calculated for each doc using the same measure as
> clustering and that produces pretty good results as far as the eyeball take
> you. I assume this is what you mean by using the doc as a query.
>
> On 6/4/12 9:14 AM, Ted Dunning wrote:
>
>> Even having millions of dimensions isn't all that bad if that induces a
>> reasonable distance between documents.  The easy way to test that is to
>> use
>> several document vectors as queries and see whether the closest other
>> documents appear to you to be very similar.  If this is true for a number
>> of documents, you should be good to go with whatever metric you are using.
>>
>> For fast clustering, you may need a low-dimensional surrogate metric so
>> that you can get higher throughput, but the point of the low-dimensional
>> surrogate is that it *replicates* the behavior of the metric that you
>> really want.  It isn't going to make your metric better.
>>
>> On Mon, Jun 4, 2012 at 5:15 PM, Pat Ferrel<pat@occamsmachete.com>  wrote:
>>
>>  After looking again at the dictionary for 150,000 web pages I have
>>> 259,000
>>> dimensions! Part of the problem is I can't get Tika to detect language
>>> very
>>> well (working on this) so I get groups of non-english pages that throw in
>>> quite a few new terms. Overall I think some form of dimensional reduction
>>> is called for, no?
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message