Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 71021 invoked from network); 23 Feb 2005 18:43:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 23 Feb 2005 18:43:18 -0000 Received: (qmail 44670 invoked by uid 500); 23 Feb 2005 18:43:17 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 43791 invoked by uid 500); 23 Feb 2005 18:43:14 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 43777 invoked by uid 99); 23 Feb 2005 18:43:14 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from server1.hostmon.com (HELO server1.hostmon.com) (66.139.76.19) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 23 Feb 2005 10:43:12 -0800 Received: (qmail 26163 invoked by uid 532); 23 Feb 2005 18:42:46 -0000 Received: from dave-lucene-user@tropo.com by server1.hostmon.com by uid 0 with qmail-scanner-1.16 (spamassassin: 3.0.0. Clear:. Processed in 0.056165 secs); 23 Feb 2005 18:42:46 -0000 Received: from unknown (HELO ?10.0.0.157?) (127.0.0.1) by 0 with SMTP; 23 Feb 2005 18:42:46 -0000 Message-ID: <421CCEDD.5020003@tropo.com> Date: Wed, 23 Feb 2005 10:43:41 -0800 From: David Spencer User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Term Weights and Clustering References: <0907B3BF-85B0-11D9-B227-000A95973046@backspaces.net> In-Reply-To: <0907B3BF-85B0-11D9-B227-000A95973046@backspaces.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'm a little confused on exactly, exactly what you want but if your goal is to cluster your papers w/ carrot2 then I found these links helpful: http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Only caveat is I found that carrot2 tends to not scale beyond 200 or so docs, though this probably depends on length of docs & the # of different tokens. I was able to use the above to integ w/ a lucene search results page in just an hour or so. Owen Densmore wrote: > I'm building a TDM (Term Document Matrix) from my lucene index. As part > of this, it would be useful to have the document term weights (the > TF*IDF-weight) if they are already available. Naturally I can compute > them, but I suspect they are lurking behind an API I've not discovered > yet. Is there an API for getting them? > > I'm doing this as a first step in discovering a good set of clustering > labels. My data collection is 1200 research papers, all of which have > good meta data: titles, authors, abstracts, keyphrases and so on. > > One source for how to do this is the thesis of Stanislaw Osinski and > others like it: > http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm > And the Carrot2 project which uses similar techniques. > http://www.cs.put.poznan.pl/dweiss/carrot/ > > My problem is simple: I need a fairly clear discussion on exactly how to > generate the labels, and to assign documents to them. The thesis is > quite good, but I'm not sure I can reduce it to practice in the 2-3 days > I have to evaluate it! Lucene has made the TDM easy to calculate, but I > basically don't know what to do next! > > Can anyone comment on whether or not this will work, and if so, suggest > a quick way to get a demo on the air? For example, I don't seem to be > able to ask Carrot2 to do a Google "site" search. If I could, I could > simply aim Carrot2 at my collection with a very general search and see > what clusters it discovers. This may be a gross misuse of Carrot2's > clustering anyway, so could easily be a blind alley. > > Or is there a different stunt with Lucene that might work? For example, > use Lucene to cluster the docs using a batch search where the queries > are Library of Congress descriptions! Batch searching is *really fast* > in Lucene -- I've been able to search the data collection against each > distinct keyphrase in seconds! > > Owen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org