lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Term Weights and Clustering
Date Wed, 23 Feb 2005 18:43:41 GMT
I'm a little confused on exactly, exactly what you want but if your goal 
is to cluster your papers w/ carrot2 then I found these links helpful:

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Only caveat is I found that carrot2 tends to not scale beyond 200 or so 
docs, though this probably depends on length of docs & the # of 
different tokens.

I was able to use the above to integ w/ a lucene search results page in 
just an hour or so.


Owen Densmore wrote:

> I'm building a TDM (Term Document Matrix) from my lucene index.  As part 
> of this, it would be useful to have the document term weights (the 
> TF*IDF-weight) if they are already available.  Naturally I can compute 
> them, but I suspect they are lurking behind an API I've not discovered 
> yet.  Is there an API for getting them?
> 
> I'm doing this as a first step in discovering a good set of clustering 
> labels.  My data collection is 1200 research papers, all of which have 
> good meta data: titles, authors, abstracts, keyphrases and so on.
> 
> One source for how to do this is the thesis of Stanislaw Osinski and 
> others like it:
>     http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
> And the Carrot2 project which uses similar techniques.
>     http://www.cs.put.poznan.pl/dweiss/carrot/
> 
> My problem is simple: I need a fairly clear discussion on exactly how to 
> generate the labels, and to assign documents to them.  The thesis is 
> quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
> I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
> basically don't know what to do next!
> 
> Can anyone comment on whether or not this will work, and if so, suggest 
> a quick way to get a demo on the air?  For example, I don't seem to be 
> able to ask Carrot2 to do a Google "site" search.  If I could, I could 
> simply aim Carrot2 at my collection with a very general search and see 
> what clusters it discovers.  This may be a gross misuse of Carrot2's 
> clustering anyway, so could easily be a blind alley.
> 
> Or is there a different stunt with Lucene that might work?  For example, 
> use Lucene to cluster the docs using a batch search where the queries 
> are Library of Congress descriptions!  Batch searching is *really fast* 
> in Lucene -- I've been able to search the data collection against each 
> distinct keyphrase in seconds!
> 
> Owen
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message