Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Message-ID: <421CCEDD.5020003@tropo.com>
Date: Wed, 23 Feb 2005 10:43:41 -0800
From: David Spencer <dave-lucene-user@tropo.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.7.3) Gecko/20040910
MIME-Version: 1.0
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: Re: Term Weights and Clustering
References: <0907B3BF-85B0-11D9-B227-000A95973046@backspaces.net>
In-Reply-To: <0907B3BF-85B0-11D9-B227-000A95973046@backspaces.net>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

I'm a little confused on exactly, exactly what you want but if your goal 
is to cluster your papers w/ carrot2 then I found these links helpful:

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Only caveat is I found that carrot2 tends to not scale beyond 200 or so 
docs, though this probably depends on length of docs & the # of 
different tokens.

I was able to use the above to integ w/ a lucene search results page in 
just an hour or so.


Owen Densmore wrote:

> I'm building a TDM (Term Document Matrix) from my lucene index.  As part 
> of this, it would be useful to have the document term weights (the 
> TF*IDF-weight) if they are already available.  Naturally I can compute 
> them, but I suspect they are lurking behind an API I've not discovered 
> yet.  Is there an API for getting them?
> 
> I'm doing this as a first step in discovering a good set of clustering 
> labels.  My data collection is 1200 research papers, all of which have 
> good meta data: titles, authors, abstracts, keyphrases and so on.
> 
> One source for how to do this is the thesis of Stanislaw Osinski and 
> others like it:
>     http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
> And the Carrot2 project which uses similar techniques.
>     http://www.cs.put.poznan.pl/dweiss/carrot/
> 
> My problem is simple: I need a fairly clear discussion on exactly how to 
> generate the labels, and to assign documents to them.  The thesis is 
> quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
> I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
> basically don't know what to do next!
> 
> Can anyone comment on whether or not this will work, and if so, suggest 
> a quick way to get a demo on the air?  For example, I don't seem to be 
> able to ask Carrot2 to do a Google "site" search.  If I could, I could 
> simply aim Carrot2 at my collection with a very general search and see 
> what clusters it discovers.  This may be a gross misuse of Carrot2's 
> clustering anyway, so could easily be a blind alley.
> 
> Or is there a different stunt with Lucene that might work?  For example, 
> use Lucene to cluster the docs using a batch search where the queries 
> are Library of Congress descriptions!  Batch searching is *really fast* 
> in Lucene -- I've been able to search the data collection against each 
> distinct keyphrase in seconds!
> 
> Owen
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org