Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  b=bh9QTwC54M+IeAzkS8Kc64MYN5GJZE4Je1CrUw2MnAyjSzeCE0cSMxVazndLz7eWlSsjpSgS+bwE+2tVQQ3tLrSnvE5ycvlFckBZfvWBOsAv2WvVMj+yYCJSpLcurWO4iMQvEdbBdAFTZxTcY6XhZ0VoB4TiswHaDL3YuwNsoR4=
  ;
Message-ID: <20041130171429.21139.qmail@web12702.mail.yahoo.com>
Date: Tue, 30 Nov 2004 09:14:29 -0800 (PST)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: similarity matrix - more clear
To: Lucene Users List <lucene-user@jakarta.apache.org>
In-Reply-To: <41AC7E50.9040905@attentio.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Hello,

I don't think Lucene can spit out the similarity matrix for you, but
perhaps you can use Lucene's Term Vector support to help you build the
matrix yourself:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html

The other relevant sections of the Lucene API to look at are:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int)
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader,%20boolean)
...

This should let you tell Lucene to compute and store term vectors
during indexing, and then you will be able to retrieve a Term Vector
for each Document in the index/collection.  Armed with this data you
should be able to compute similarities between Documents with TV dot
products/cosines, which should be enough for you to build your
similarity matrix.

This sounds like something that would be nice to have in the Lucene
Sandbox, so if you end up with some code that you are allowed to share,
please contribute it back to Lucene.

Otis

--- Roxana Angheluta <roxana@attentio.com> wrote:

> Dear all,
> 
> Yesterday I've asked a question about geting the similarity matrix of
> a 
> collection of documents from an index, but I got only one answer, so 
> perhaps my question was not very clear.
> 
> I will try to reformulate:
> 
> I want to use Lucene to have efficient access to an index of a 
> collection of documents. My final purpose is to cluster documents. 
> Therefore I need to have for each pair of documents a number
> signifying 
> the similarity between them.
> A possible solution would be to initialize in turn each document as a
> 
> query, do a search using an IndexSearcher and to take from the search
> 
> result the similarity between the query (which is in fact a document)
> 
> and all the other documents. This is highly redundant, because the 
> similarity between a pair of documents is computed multiple times.
> 
> I was wondering whether there is a simpler way to do it, since the
> index 
> file contains all the information needed. Can anyone help me here?
> 
> Thanks,
> roxana
> 
> PS I know about the project Carrot2, which deals with document 
> clustering, but I think is not appropriate for me because of 2
> reasons:
> 1) I need to keep the index on the disk for further reusage
> 2) I need to be able to search efficiently in the index
> I thought Lucene can help me here, am I wrong?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org