lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Jain" <>
Subject Re: Document Clustering
Date Wed, 12 Nov 2003 09:03:20 GMT
> I was basically thinking of using lucene to generate document
> vectors, and writing my custom similarity algorithms for measuring
> distance.
> I could then run this data through k-means or SOM algorithms for
> calculating clusters

First of all, I think it would already be great if there was some
functionality for simply storing document vectors during the indexing
process, so you could later on use

  IndexSearcher.docTerms(int i)

to retrieve a BitSet or an array of floats that are weighted so that
frequent terms have lower values.

One difficulty I see here is that terms don't seem to have any unique
identifiers, guess you'd have to manage those yourself...

Eric Jain

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message