Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com
 designates 209.85.160.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=O2fLUC5TWdRkA0XsRhuGf8LRKBcNZP0cO2w5xZeN4TUhv6l6yqLsE2kTSLOYDDwU7w
         gmP07Nh6g6gYmobG4ofBW//H/Q+rwDHdrcUAGI8pDbQLMMr//CbvP5BzQa0o9+hsFVa6
         pttI+ROXXQaXFTWaf7WeLVriGq5ZtsrTHmzHs=
MIME-Version: 1.0
In-Reply-To: <AANLkTimGqHTk1_Q0t1W5SmhNY9AGxq5V2kT3JKqWWWYx@mail.gmail.com>
References: <AANLkTikz1BLhOG4pv5uEGIOLl5pKUKT_uoETGaS6NMsb@mail.gmail.com>
	<AANLkTin0qVeaHCFkrWF7Le9YZxMN4I9qD_KZYeAhSPNH@mail.gmail.com>
	<AANLkTimGqHTk1_Q0t1W5SmhNY9AGxq5V2kT3JKqWWWYx@mail.gmail.com>
From: Jake Mannix <jake.mannix@gmail.com>
Date: Tue, 8 Jun 2010 09:26:52 -0700
Message-ID: <AANLkTindRW5vutBRW85RYcMBBlhjiL6psFcsjDaFxx_F@mail.gmail.com>
Subject: Re: Generating a Document Similarity Matrix
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=00504502e12ff4e21704888742ef

--00504502e12ff4e21704888742ef
Content-Type: text/plain; charset=ISO-8859-1

Hi Kris,

  If you generate a full document-document similarity matrix offline, and
then make sure to sparsify the rows (trim off all similarities below a
threshold, or only take the top N for each row, etc...).  Then encoding
these values directly in the index would indeed allow for *superfast*
MoreLikeThis functionality, because you've already computed all
of the similar results offline.

  The only downside is that it won't apply to newly indexed documents.
If your indexing setup is such that you don't fold in new documents live,
but do so in batch, then this should be fine.

  An alternative is to use something like a Locality Sensitive Hash
(something one of my co-workers is writing up a nice implementation
of now, and I'm going to get him to contribute it once it's fully tested),
to reduce the search space (as a lucene Filter) and speed up the
query.

  -jake

On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <mrkrisjack@gmail.com> wrote:

> Hi Olivier,
>
> Thanks for your suggestions.  I have over 10 million documents and they
> have
> quite a lot of meta-data associated with them including rather large text
> fields.  It is possible to tweak the moreLikeThis function from solr.  I
> have tried changing the parameters (
> http://wiki.apache.org/solr/MoreLikeThis)
> but am not managing to get results in under 300ms without sacrificing the
> quality of the results too much.
>
> I suspect that there would be gains to be made from reducing the
> dimensionality of the feature vectors before indexing with lucene so I may
> give that a try.  I'll keep you posted if I come up with other solutions.
>
> Thanks,
> Kris
>
>
>
> 2010/6/8 Olivier Grisel <olivier.grisel@ensta.org>
>
> > 2010/6/8 Kris Jack <mrkrisjack@gmail.com>:
> > > Hi everyone,
> > >
> > > I currently use lucene's moreLikeThis function through solr to find
> > > documents that are related to one another.  A single call, however,
> takes
> > > around 4 seconds to complete and I would like to reduce this.  I got to
> > > thinking that I might be able to use Mahout to generate a document
> > > similarity matrix offline that could then be looked-up in real time for
> > > serving.  Is this a reasonable use of Mahout?  If so, what functions
> will
> > > generate a document similarity matrix?  Also, I would like to be able
> to
> > > keep the text processing advantages provided through lucene so it would
> > help
> > > if I could still use my lucene index.  If not, then could you recommend
> > any
> > > alternative solutions please?
> >
> > How many documents do you have in your index? Have you tried to tweak
> > the MoreLikeThis parameters ? (I don't know if it's possible using the
> > solr interface, I use it directly using the lucene java API)
> >
> > For instance you can trade off recall for speed by decreasing the
> > number of terms to use in the query and trade recall for precision and
> > speed by increasing the percentage of terms that should match.
> >
> > You could also use Mahout implementation of SVD to build low
> > dimensional semantic vectors representing your documents (a.k.a.
> > Latent Semantic Indexing) and then index those transformed frequency
> > vectors in a dedicated lucene index (or document field provided you
> > name the resulting terms with something that does not match real life
> > terms present in other). However using standard SVD will probably
> > result in dense (as opposed to sparse) low dimensional semantic
> > vectors. I don't think lucene's lookup performance is good with dense
> > frequency vectors even though the number of terms is greatly reduced
> > by SVD. Hence it would probably be better to either threshold the top
> > 100 absolute values of each semantic vectors before indexing (probably
> > the simpler solution) or using a sparsifying penalty contrained
> > variant of SVD / LSI. You should have a look at the literature on
> > sparse coding or sparse dictionary learning, Sparse-PCA and more
> > generally L1 penalty regression methods such as the Lasso and LARS. I
> > don't know about any library  for sparse semantic coding of document
> > that works automatically with lucene. Probably some non trivial coding
> > is needed there.
> >
> > Another alternative is finding low dimensional (64 or 32 components)
> > dense codes and then binary thresholding then and store integer code
> > in the DB or the lucene index and then build smart exact match queries
> > to find all document lying in the hamming ball of size 1 or 2 of the
> > reference document's binary code. But I think this approach while
> > promising for web scale document collections is even more experimental
> > and requires very good code low dim encoders (I don't think linear
> > models such as SVD are good enough for reducing sparse 10e6 components
> > vectors to dense 64 components vectors, non linear encoders such as
> > Stacked Restricted Boltzmann Machines are probably a better choice).
> >
> > In any case let us know about your results, I am really interested on
> > practical yet scalable solutions to this problem.
> >
> > --
> > Olivier
> > http://twitter.com/ogrisel - http://github.com/ogrisel
> >
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>

--00504502e12ff4e21704888742ef--