lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Term Weights and Clustering
Date Wed, 23 Feb 2005 18:43:41 GMT
I'm a little confused on exactly, exactly what you want but if your goal 
is to cluster your papers w/ carrot2 then I found these links helpful:

Only caveat is I found that carrot2 tends to not scale beyond 200 or so 
docs, though this probably depends on length of docs & the # of 
different tokens.

I was able to use the above to integ w/ a lucene search results page in 
just an hour or so.

Owen Densmore wrote:

> I'm building a TDM (Term Document Matrix) from my lucene index.  As part 
> of this, it would be useful to have the document term weights (the 
> TF*IDF-weight) if they are already available.  Naturally I can compute 
> them, but I suspect they are lurking behind an API I've not discovered 
> yet.  Is there an API for getting them?
> I'm doing this as a first step in discovering a good set of clustering 
> labels.  My data collection is 1200 research papers, all of which have 
> good meta data: titles, authors, abstracts, keyphrases and so on.
> One source for how to do this is the thesis of Stanislaw Osinski and 
> others like it:
> And the Carrot2 project which uses similar techniques.
> My problem is simple: I need a fairly clear discussion on exactly how to 
> generate the labels, and to assign documents to them.  The thesis is 
> quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
> I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
> basically don't know what to do next!
> Can anyone comment on whether or not this will work, and if so, suggest 
> a quick way to get a demo on the air?  For example, I don't seem to be 
> able to ask Carrot2 to do a Google "site" search.  If I could, I could 
> simply aim Carrot2 at my collection with a very general search and see 
> what clusters it discovers.  This may be a gross misuse of Carrot2's 
> clustering anyway, so could easily be a blind alley.
> Or is there a different stunt with Lucene that might work?  For example, 
> use Lucene to cluster the docs using a batch search where the queries 
> are Library of Congress descriptions!  Batch searching is *really fast* 
> in Lucene -- I've been able to search the data collection against each 
> distinct keyphrase in seconds!
> Owen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message