Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 31948 invoked from network); 30 Nov 2004 14:07:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 30 Nov 2004 14:07:56 -0000 Received: (qmail 37195 invoked by uid 500); 30 Nov 2004 14:06:18 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 37160 invoked by uid 500); 30 Nov 2004 14:06:17 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 37135 invoked by uid 99); 30 Nov 2004 14:06:16 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FORGED_RCVD_HELO X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from gwserv161.interdoms.net (HELO gwserv161.interdoms.net) (69.65.18.194) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 30 Nov 2004 06:06:15 -0800 Received: from ws97.brusselsvillage.be ([194.183.227.97] helo=attentio.com) by gwserv161.interdoms.net with esmtpa (Exim 4.43) id 1CZ8du-0004S1-Sw for lucene-user@jakarta.apache.org; Tue, 30 Nov 2004 15:06:11 +0100 Message-ID: <41AC7E50.9040905@attentio.com> Date: Tue, 30 Nov 2004 15:06:08 +0100 From: Roxana Angheluta User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040114 X-Accept-Language: en-us, en MIME-Version: 1.0 To: lucene-user@jakarta.apache.org Subject: similarity matrix - more clear Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gwserv161.interdoms.net X-AntiAbuse: Original Domain - jakarta.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - attentio.com X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Dear all, Yesterday I've asked a question about geting the similarity matrix of a collection of documents from an index, but I got only one answer, so perhaps my question was not very clear. I will try to reformulate: I want to use Lucene to have efficient access to an index of a collection of documents. My final purpose is to cluster documents. Therefore I need to have for each pair of documents a number signifying the similarity between them. A possible solution would be to initialize in turn each document as a query, do a search using an IndexSearcher and to take from the search result the similarity between the query (which is in fact a document) and all the other documents. This is highly redundant, because the similarity between a pair of documents is computed multiple times. I was wondering whether there is a simpler way to do it, since the index file contains all the information needed. Can anyone help me here? Thanks, roxana PS I know about the project Carrot2, which deals with document clustering, but I think is not appropriate for me because of 2 reasons: 1) I need to keep the index on the disk for further reusage 2) I need to be able to search efficiently in the index I thought Lucene can help me here, am I wrong? --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org