lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: A simple Vector Space Model and TFIDF usage
Date Tue, 30 Jun 2009 16:13:21 GMT

On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:

> Hi,
> It's my first experiment with Lucene. Please help me.
> I'm going to index a set of documents and create a feature vector  
> for each of them. This vector contains all terms belong to the  
> document that weight using TFIDF.
> After that I want to compute the cosine similarity between all  
> documents and produce a doc-doc similarity matrix. My document set  
> is large and it's important to have a scalable implementation.


See Mahout (http://lucene.apache.org/mahout).  In the utils module, is  
a class called LuceneIterable that the o.a.mahout.utils.vectors.Driver  
program can use to convert a Lucene index into a Mahout Vector  
representation, which can then be used to create a d-d similarity  
matrix.  It uses Hadoop, so you can go as big as you want.

See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message