mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: lsi
Date Mon, 14 Nov 2011 06:40:30 GMT
On Sun, Nov 13, 2011 at 10:31 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> I have done this with Lucene (some time ago) and had a hell of a time
> getting decent performance if I wanted to rescore a thousand documents from
> a disk based index.  That implies a memory based system again.  The cost of
> a thousand or so rescores is probably about a millisecond or so.  Since
> each vector is roughly a few cache lines in size, the achievable memory
> bandwidth should be significant.
>

Yeah, I guess I tend to make the assumption that everyone is all in memory
like I've been the past 4 years or so.  I have no idea what the current
Lucene
cost of looking up additional binary payloads from disk is while in the
inner
loop.  I could totally believe it's prohibitive.


>  Alternatively, to improve recall, at index-time, supplement each document
> > by terms in a new field "lsi_expanded" which are the terms closest in the
> > SVD projected space to the document, but aren't already in it.  Then at
> > query time, add an "... OR lsi_expanded:<query>" clause onto your query.
> > Instant query-expansion for recall enhancement.
> >
>
> This actually is pretty tricky to make well.
>

I never said it was necessarily a *good* idea to use LSI in this way (or,
in
fact, to use LSI at all), just that if you *do* have a good scoring model
(like
some kind of strongly predictive static prior, like PageRank), then doing
even fairly dumb recall-enhancing techniques can improve things quite
nicely, and "discretized" LSI like this is a "not completely dumb" way to
enhance recall.

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message