Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 26908 invoked from network); 26 Jul 2007 15:23:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jul 2007 15:23:40 -0000 Received: (qmail 56565 invoked by uid 500); 26 Jul 2007 15:23:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 56539 invoked by uid 500); 26 Jul 2007 15:23:35 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 56528 invoked by uid 99); 26 Jul 2007 15:23:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2007 08:23:35 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [208.97.132.177] (HELO spunkymail-a14.g.dreamhost.com) (208.97.132.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2007 08:23:32 -0700 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a14.g.dreamhost.com (Postfix) with ESMTP id 79C6C190E36 for ; Thu, 26 Jul 2007 08:23:11 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: <46A79EBF.4070500@grivolla.net> References: <46A79EBF.4070500@grivolla.net> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: MoreLikeThis for multiple documents Date: Thu, 26 Jul 2007 11:23:00 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.3) X-Virus-Checked: Checked by ClamAV on apache.org I have some sample code for doing relevance feedback across multiple documents at http://www.cnlp.org/apachecon2005 It could be modified to provide more of the MoreLikeThis functionality (i.e. determining important terms via tf/idf) for now it just takes the top X terms -Grant On Jul 25, 2007, at 3:04 PM, Jens Grivolla wrote: > Hello, > > I'm looking to extract significant terms characterizing a set of > documents (which in turn relate to a topic). > > This basically comes down to functionality similar to determining > the terms with the greatest offer weight (as used for blind > relevance feedback), or maximizing tf.idf (as is done in > MoreLikeThis). > > Is there anything like this already implemented, or do I need to > iterate through all documents in the set "manually", re-tokenize > each one (or maybe use TermVectors), and then calculate the weight > for each term? > > Thanks, > Jens > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org