lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Jain" <Eric.J...@isb-sib.ch>
Subject Re: Document Clustering
Date Wed, 12 Nov 2003 09:03:20 GMT
> I was basically thinking of using lucene to generate document
> vectors, and writing my custom similarity algorithms for measuring
> distance.
>
> I could then run this data through k-means or SOM algorithms for
> calculating clusters

First of all, I think it would already be great if there was some
functionality for simply storing document vectors during the indexing
process, so you could later on use

  IndexSearcher.docTerms(int i)

to retrieve a BitSet or an array of floats that are weighted so that
frequent terms have lower values.

One difficulty I see here is that terms don't seem to have any unique
identifiers, guess you'd have to manage those yourself...

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message