lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen Densmore <o...@backspaces.net>
Subject Term Weights and Clustering
Date Wed, 23 Feb 2005 15:31:49 GMT
I'm building a TDM (Term Document Matrix) from my lucene index.  As 
part of this, it would be useful to have the document term weights (the 
TF*IDF-weight) if they are already available.  Naturally I can compute 
them, but I suspect they are lurking behind an API I've not discovered 
yet.  Is there an API for getting them?

I'm doing this as a first step in discovering a good set of clustering 
labels.  My data collection is 1200 research papers, all of which have 
good meta data: titles, authors, abstracts, keyphrases and so on.

One source for how to do this is the thesis of Stanislaw Osinski and 
others like it:
     http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
     http://www.cs.put.poznan.pl/dweiss/carrot/

My problem is simple: I need a fairly clear discussion on exactly how 
to generate the labels, and to assign documents to them.  The thesis is 
quite good, but I'm not sure I can reduce it to practice in the 2-3 
days I have to evaluate it!  Lucene has made the TDM easy to calculate, 
but I basically don't know what to do next!

Can anyone comment on whether or not this will work, and if so, suggest 
a quick way to get a demo on the air?  For example, I don't seem to be 
able to ask Carrot2 to do a Google "site" search.  If I could, I could 
simply aim Carrot2 at my collection with a very general search and see 
what clusters it discovers.  This may be a gross misuse of Carrot2's 
clustering anyway, so could easily be a blind alley.

Or is there a different stunt with Lucene that might work?  For 
example, use Lucene to cluster the docs using a batch search where the 
queries are Library of Congress descriptions!  Batch searching is 
*really fast* in Lucene -- I've been able to search the data collection 
against each distinct keyphrase in seconds!

Owen


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message