lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parnab kumar <parnab.2...@gmail.com>
Subject Re: Document Term matrix
Date Tue, 11 Nov 2014 21:37:51 GMT
hi,

 While indexing the documents , store the Term Vectors for the content
field. Now for each document you will have an array of terms  and their
corresponding frequency in the document. Using the Index Reader you can
retrieve this term vectors. Similarity between two documents can be
computed as the similarity of their term vectors. Since tf-idf is most well
known and seems to give better sense of similarity, simply multiply the idf
of the term with the frequency to re weight the vectors.

Thanks,
Parnab

On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali <elshaimaa.ali@hotmail.com>
wrote:

> Hi All,
> I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
> to extract a Document-term matrix, and Document Document similarity matrix
> in-order to use it to cluster the documents. My questions:1- How can I
> extract the matrix and compute the similarity between documents in
> Lucene.2- Is there any java based code that can cluster the documents from
> Lucene index.
> RegardsShaimaa
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message