lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <>
Subject Re: Term Weights and Clustering
Date Thu, 24 Feb 2005 11:51:54 GMT

Hi Owen,

I'm from the Carrot2 project, so I feel called to the blackboard:

> One source for how to do this is the thesis of Stanislaw Osinski and 
> others like it:
> And the Carrot2 project which uses similar techniques.

Staszek Osinski is the author of Lingo, the best clustering algorithm 
available in Carrot2 -- we still work together in that project... In 
other words, Carrot2 doesn't use 'similar' techniques. It uses _the_ 
techniques described in the above thesis (and other various papars, see 
my Web page).

> My problem is simple: I need a fairly clear discussion on exactly how to 
> generate the labels, and to assign documents to them.  The thesis is 
> quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
> I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
> basically don't know what to do next!

You can use Carrot2 directly for that. There are a few options. One 
thing is you can directly feed your input collection to the clustering 
component (it will take a while, but should work) -- you need to write a 
custom input component, but it is a very simple thing to do and I'm sure 
if you write to Carrot2 mailing list there will be somebody willing to 
help (like myself or Staszek ;).

Another option is: use Lucene to index your documents. Set up Carrot2 to 
use Lucene (described somewhere on this list, see David Spencer's message).

> a quick way to get a demo on the air?  For example, I don't seem to be 
> able to ask Carrot2 to do a Google "site" search.  

Yep, there is a problem with it. Post a bug report to carrot2 bugzilla, 
please. I'll investigate it when I have time.

> simply aim Carrot2 at my collection with a very general search and see 
> what clusters it discovers.  This may be a gross misuse of Carrot2's 
> clustering anyway, so could easily be a blind alley.

It kind of is because carrot2 clustering components work primarily with 
_short_, scarce information sources, such as snippets. We don't intend 
to work on large, raw documents collections... Having said that, a 1200 
documents isn't that much and you should be able to get your clusters.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message