Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 83411 invoked from network); 30 Nov 2004 17:15:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 30 Nov 2004 17:15:03 -0000 Received: (qmail 16069 invoked by uid 500); 30 Nov 2004 17:14:51 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 16045 invoked by uid 500); 30 Nov 2004 17:14:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 16029 invoked by uid 99); 30 Nov 2004 17:14:51 -0000 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=DNS_FROM_RFC_ABUSE X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from web12702.mail.yahoo.com (HELO web12702.mail.yahoo.com) (216.136.173.239) by apache.org (qpsmtpd/0.28) with SMTP; Tue, 30 Nov 2004 09:14:45 -0800 Received: (qmail 21146 invoked by uid 60001); 30 Nov 2004 17:14:31 -0000 Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; b=bh9QTwC54M+IeAzkS8Kc64MYN5GJZE4Je1CrUw2MnAyjSzeCE0cSMxVazndLz7eWlSsjpSgS+bwE+2tVQQ3tLrSnvE5ycvlFckBZfvWBOsAv2WvVMj+yYCJSpLcurWO4iMQvEdbBdAFTZxTcY6XhZ0VoB4TiswHaDL3YuwNsoR4= ; Message-ID: <20041130171429.21139.qmail@web12702.mail.yahoo.com> Received: from [216.194.17.194] by web12702.mail.yahoo.com via HTTP; Tue, 30 Nov 2004 09:14:29 PST Date: Tue, 30 Nov 2004 09:14:29 -0800 (PST) From: Otis Gospodnetic Subject: Re: similarity matrix - more clear To: Lucene Users List In-Reply-To: <41AC7E50.9040905@attentio.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hello, I don't think Lucene can spit out the similarity matrix for you, but perhaps you can use Lucene's Term Vector support to help you build the matrix yourself: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html The other relevant sections of the Lucene API to look at are: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int) http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader,%20boolean) ... This should let you tell Lucene to compute and store term vectors during indexing, and then you will be able to retrieve a Term Vector for each Document in the index/collection. Armed with this data you should be able to compute similarities between Documents with TV dot products/cosines, which should be enough for you to build your similarity matrix. This sounds like something that would be nice to have in the Lucene Sandbox, so if you end up with some code that you are allowed to share, please contribute it back to Lucene. Otis --- Roxana Angheluta wrote: > Dear all, > > Yesterday I've asked a question about geting the similarity matrix of > a > collection of documents from an index, but I got only one answer, so > perhaps my question was not very clear. > > I will try to reformulate: > > I want to use Lucene to have efficient access to an index of a > collection of documents. My final purpose is to cluster documents. > Therefore I need to have for each pair of documents a number > signifying > the similarity between them. > A possible solution would be to initialize in turn each document as a > > query, do a search using an IndexSearcher and to take from the search > > result the similarity between the query (which is in fact a document) > > and all the other documents. This is highly redundant, because the > similarity between a pair of documents is computed multiple times. > > I was wondering whether there is a simpler way to do it, since the > index > file contains all the information needed. Can anyone help me here? > > Thanks, > roxana > > PS I know about the project Carrot2, which deals with document > clustering, but I think is not appropriate for me because of 2 > reasons: > 1) I need to keep the index on the disk for further reusage > 2) I need to be able to search efficiently in the index > I thought Lucene can help me here, am I wrong? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org