lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Document Term matrix
Date Tue, 11 Nov 2014 21:41:10 GMT
The project semanticvectors might be doing what you are looking for.

On 11 nov. 2014, at 22:37, parnab kumar <> wrote:

> hi,
> While indexing the documents , store the Term Vectors for the content
> field. Now for each document you will have an array of terms  and their
> corresponding frequency in the document. Using the Index Reader you can
> retrieve this term vectors. Similarity between two documents can be
> computed as the similarity of their term vectors. Since tf-idf is most well
> known and seems to give better sense of similarity, simply multiply the idf
> of the term with the frequency to re weight the vectors.
> Thanks,
> Parnab
> On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali <>
> wrote:
>> Hi All,
>> I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
>> to extract a Document-term matrix, and Document Document similarity matrix
>> in-order to use it to cluster the documents. My questions:1- How can I
>> extract the matrix and compute the similarity between documents in
>> Lucene.2- Is there any java based code that can cluster the documents from
>> Lucene index.
>> RegardsShaimaa

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message