lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene & LSA
Date Thu, 14 Dec 2006 20:10:53 GMT

On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:

>> it is possible to extract the matrix from the indexing file?
> I don’t know any API to extract the matrix from the index file  
> directly.

How could we make it work to write an open source decomposed vector  
model search engine a la LSA without running afoul of the LSA  
patents?  Maybe use an algorithm other than SVD for the decomposition?

I'm only superficially familiar with LSA, but I'm always looking for  
ways to improve relevance.  In theory it would be nice to factor in a  
decomposed similarity measure, so that on a search for 'napoleonic  
war', documents which contained a lot of words which were similar to  
either 'napoleon' and 'war' would score higher than documents which  
had only a passing mention.

Personally, I'm less interested in "more like this" queries, because  
the precision of search results based solely on on similar document  
vectors is so poor -- proper names and other rare tokens unrelated to  
the original query wreak havok on the relevance scores.  But maybe  
there's a way in the original keyword search to juice up the scores  
of documents which not only contain the original terms, but also a  
lot of terms which are similar to them.

I dunno if it would be worth the computational effort, though.  A  
decomposed matrix is going to be inherently expensive to generate,  
because you have to start from a complete matrix.  That doesn't jibe  
well with incremental indexing.

Also, it's not clear to me how much of a gain we'd get in relevance.   
My hunch is that shorter, tightly focused documents would benefit  
some and that longer more diffuse documents -- which might contain  
passages which were just as useful as those in a shorter document --  
would lose.  That wouldn't be helpful for a common case in naive web  
search, where impossible-to-exclude navigational and advertising text  
could end up diluting the scores of perfectly good material.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message