lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Term Weights and Clustering
Date Thu, 24 Feb 2005 11:51:54 GMT

Hi Owen,

I'm from the Carrot2 project, so I feel called to the blackboard:

> One source for how to do this is the thesis of Stanislaw Osinski and 
> others like it:
>     http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
> And the Carrot2 project which uses similar techniques.
>     http://www.cs.put.poznan.pl/dweiss/carrot/

Staszek Osinski is the author of Lingo, the best clustering algorithm 
available in Carrot2 -- we still work together in that project... In 
other words, Carrot2 doesn't use 'similar' techniques. It uses _the_ 
techniques described in the above thesis (and other various papars, see 
my Web page).

> My problem is simple: I need a fairly clear discussion on exactly how to 
> generate the labels, and to assign documents to them.  The thesis is 
> quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
> I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
> basically don't know what to do next!

You can use Carrot2 directly for that. There are a few options. One 
thing is you can directly feed your input collection to the clustering 
component (it will take a while, but should work) -- you need to write a 
custom input component, but it is a very simple thing to do and I'm sure 
if you write to Carrot2 mailing list there will be somebody willing to 
help (like myself or Staszek ;).

Another option is: use Lucene to index your documents. Set up Carrot2 to 
use Lucene (described somewhere on this list, see David Spencer's message).

> a quick way to get a demo on the air?  For example, I don't seem to be 
> able to ask Carrot2 to do a Google "site" search.  

Yep, there is a problem with it. Post a bug report to carrot2 bugzilla, 
please. I'll investigate it when I have time.

> simply aim Carrot2 at my collection with a very general search and see 
> what clusters it discovers.  This may be a gross misuse of Carrot2's 
> clustering anyway, so could easily be a blind alley.

It kind of is because carrot2 clustering components work primarily with 
_short_, scarce information sources, such as snippets. We don't intend 
to work on large, raw documents collections... Having said that, a 1200 
documents isn't that much and you should be able to get your clusters.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message