lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: Document Term matrix
Date Tue, 11 Nov 2014 21:41:10 GMT
The project semanticvectors might be doing what you are looking for.
paul


On 11 nov. 2014, at 22:37, parnab kumar <parnab.2007@gmail.com> wrote:

> hi,
> 
> While indexing the documents , store the Term Vectors for the content
> field. Now for each document you will have an array of terms  and their
> corresponding frequency in the document. Using the Index Reader you can
> retrieve this term vectors. Similarity between two documents can be
> computed as the similarity of their term vectors. Since tf-idf is most well
> known and seems to give better sense of similarity, simply multiply the idf
> of the term with the frequency to re weight the vectors.
> 
> Thanks,
> Parnab
> 
> On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali <elshaimaa.ali@hotmail.com>
> wrote:
> 
>> Hi All,
>> I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
>> to extract a Document-term matrix, and Document Document similarity matrix
>> in-order to use it to cluster the documents. My questions:1- How can I
>> extract the matrix and compute the similarity between documents in
>> Lucene.2- Is there any java based code that can cluster the documents from
>> Lucene index.
>> RegardsShaimaa
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message