lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lorenzo Viscanti <>
Subject Re: Regarding Lucene and LSI
Date Fri, 07 Oct 2005 08:47:05 GMT
I use my own LSI implementation based on Lucene for text clustering.
I've done some tests, but I do believe that integrating LSI onto the lucene
search subsystem (i.e. creating something like LSISimilarity) is not an easy

I start analyzing the documents using Lucene, and then extract tfidf values
(with lucene again), in order to build a documents/terms matrix. Then I use
an implementation of LSI/SVD to analyze it.
At this point I think that reassigning the scores back to Lucene documents
is very difficult; but I'm trying to grab the modified scores from the
matrix on my LSISImilarity.
Instead clustering search results this way is not too difficult, I just
apply the algorithm (mostly HAC-like) to the modified matrix.
To search using LSI you must choose a small subset of the collection and
then apply LSI/SVD to it, then extend the matrix by 'folding in' new
documents. But how to choose the initial subset? Maybe just searching the
index and then using the first n documents retrieved.
Any idea?

On 10/7/05, Paul Libbrecht <> wrote:
> I've met other persons with such needs and we would also be interested.
> Unfortunately, this seems not to be available.
> A clear issue might be that LSI, in its original form at least, is
> covered by an US patent. But maybe someone finds another form which is
> not.
> paul
> Le 5 oct. 05, à 14:59, <> a écrit :
> > I am looking for LSI implementation i lucene. Is it available. I
> > couldnt find it in the website. I searched in the archives but no
> > help. could some one tell me if it is available or not.
> >
> > Could you tell me where can i see to find if there are any Language
> > processing tools for Indexing and retrieval stuff available
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message