lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles Efron <mef...@metalab.unc.edu>
Subject Re: Lucene & LSA
Date Thu, 14 Dec 2006 20:23:59 GMT
U of Tennessee professor Michael Berry maintains a good site regarding 
software for computing SVD on large, sparse matrices:

 	http://www.cs.utk.edu/~lsi/

The site also points to the LSI patent.

FWIW it's very easy to extract term-doc counts from a lucene index and 
format them for software such as SVDPACK

 	http://www.netlib.org/svdpack/index.html

or

 	http://tedlab.mit.edu:16080/~dr/SVDLIBC/

Of course using the resulting matrices isn't so trivial, if you want to 
stay within Lucene.  Also, they are dense, so even at relatively low 
dimensionality, you're still storing a lot of data.

-Miles

On Thu, 14 Dec 2006, Marvin Humphrey wrote:

>
> On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:
>
>>> it is possible to extract the matrix from the indexing file?
>> 
>> I don?t know any API to extract the matrix from the index file directly.
>
> How could we make it work to write an open source decomposed vector model 
> search engine a la LSA without running afoul of the LSA patents?  Maybe use 
> an algorithm other than SVD for the decomposition?
>
> I'm only superficially familiar with LSA, but I'm always looking for ways to 
> improve relevance.  In theory it would be nice to factor in a decomposed 
> similarity measure, so that on a search for 'napoleonic war', documents which 
> contained a lot of words which were similar to either 'napoleon' and 'war' 
> would score higher than documents which had only a passing mention.
>
> Personally, I'm less interested in "more like this" queries, because the 
> precision of search results based solely on on similar document vectors is so 
> poor -- proper names and other rare tokens unrelated to the original query 
> wreak havok on the relevance scores.  But maybe there's a way in the original 
> keyword search to juice up the scores of documents which not only contain the 
> original terms, but also a lot of terms which are similar to them.
>
> I dunno if it would be worth the computational effort, though.  A decomposed 
> matrix is going to be inherently expensive to generate, because you have to 
> start from a complete matrix.  That doesn't jibe well with incremental 
> indexing.
>
> Also, it's not clear to me how much of a gain we'd get in relevance.  My 
> hunch is that shorter, tightly focused documents would benefit some and that 
> longer more diffuse documents -- which might contain passages which were just 
> as useful as those in a shorter document -- would lose.  That wouldn't be 
> helpful for a common case in naive web search, where impossible-to-exclude 
> navigational and advertising text could end up diluting the scores of 
> perfectly good material.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

__________________________________
Miles Efron
http://www.ibiblio.org/mefron
mefron@ibiblio.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message