lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles Efron <>
Subject Re: Lucene & LSA
Date Thu, 14 Dec 2006 20:23:59 GMT
U of Tennessee professor Michael Berry maintains a good site regarding 
software for computing SVD on large, sparse matrices:

The site also points to the LSI patent.

FWIW it's very easy to extract term-doc counts from a lucene index and 
format them for software such as SVDPACK


Of course using the resulting matrices isn't so trivial, if you want to 
stay within Lucene.  Also, they are dense, so even at relatively low 
dimensionality, you're still storing a lot of data.


On Thu, 14 Dec 2006, Marvin Humphrey wrote:

> On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:
>>> it is possible to extract the matrix from the indexing file?
>> I don?t know any API to extract the matrix from the index file directly.
> How could we make it work to write an open source decomposed vector model 
> search engine a la LSA without running afoul of the LSA patents?  Maybe use 
> an algorithm other than SVD for the decomposition?
> I'm only superficially familiar with LSA, but I'm always looking for ways to 
> improve relevance.  In theory it would be nice to factor in a decomposed 
> similarity measure, so that on a search for 'napoleonic war', documents which 
> contained a lot of words which were similar to either 'napoleon' and 'war' 
> would score higher than documents which had only a passing mention.
> Personally, I'm less interested in "more like this" queries, because the 
> precision of search results based solely on on similar document vectors is so 
> poor -- proper names and other rare tokens unrelated to the original query 
> wreak havok on the relevance scores.  But maybe there's a way in the original 
> keyword search to juice up the scores of documents which not only contain the 
> original terms, but also a lot of terms which are similar to them.
> I dunno if it would be worth the computational effort, though.  A decomposed 
> matrix is going to be inherently expensive to generate, because you have to 
> start from a complete matrix.  That doesn't jibe well with incremental 
> indexing.
> Also, it's not clear to me how much of a gain we'd get in relevance.  My 
> hunch is that shorter, tightly focused documents would benefit some and that 
> longer more diffuse documents -- which might contain passages which were just 
> as useful as those in a shorter document -- would lose.  That wouldn't be 
> helpful for a common case in naive web search, where impossible-to-exclude 
> navigational and advertising text could end up diluting the scores of 
> perfectly good material.
> Marvin Humphrey
> Rectangular Research
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Miles Efron

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message