Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 80307 invoked from network); 8 Jun 2010 16:27:41 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jun 2010 16:27:41 -0000 Received: (qmail 94648 invoked by uid 500); 8 Jun 2010 16:27:40 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 94614 invoked by uid 500); 8 Jun 2010 16:27:40 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 94606 invoked by uid 99); 8 Jun 2010 16:27:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 16:27:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.160.42 as permitted sender) Received: from [209.85.160.42] (HELO mail-pw0-f42.google.com) (209.85.160.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 16:27:34 +0000 Received: by pwi4 with SMTP id 4so143844pwi.1 for ; Tue, 08 Jun 2010 09:27:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=1gutQa/J4W0jJwTKqLhZo5W+95m0hy56cnthneDu2x4=; b=omYvGNGIKUNyv1qvk0lWVVm4BtI+R7vVyfxrA/uzClQAde8mayMNJQHttrsHf8LKeH fpAYYzEyF4lMWxEi7MgVp+8p5/W9MQZxBx2G3gx7r7K09RrzL12MnYbpo29zJtKbn68e XUd6XCX5iQvpd7GIOS8rffu6bCgbkR7x1+bM4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=O2fLUC5TWdRkA0XsRhuGf8LRKBcNZP0cO2w5xZeN4TUhv6l6yqLsE2kTSLOYDDwU7w gmP07Nh6g6gYmobG4ofBW//H/Q+rwDHdrcUAGI8pDbQLMMr//CbvP5BzQa0o9+hsFVa6 pttI+ROXXQaXFTWaf7WeLVriGq5ZtsrTHmzHs= Received: by 10.114.10.19 with SMTP id 19mr13182790waj.75.1276014432149; Tue, 08 Jun 2010 09:27:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.115.32.8 with HTTP; Tue, 8 Jun 2010 09:26:52 -0700 (PDT) In-Reply-To: References: From: Jake Mannix Date: Tue, 8 Jun 2010 09:26:52 -0700 Message-ID: Subject: Re: Generating a Document Similarity Matrix To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=00504502e12ff4e21704888742ef X-Virus-Checked: Checked by ClamAV on apache.org --00504502e12ff4e21704888742ef Content-Type: text/plain; charset=ISO-8859-1 Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack wrote: > Hi Olivier, > > Thanks for your suggestions. I have over 10 million documents and they > have > quite a lot of meta-data associated with them including rather large text > fields. It is possible to tweak the moreLikeThis function from solr. I > have tried changing the parameters ( > http://wiki.apache.org/solr/MoreLikeThis) > but am not managing to get results in under 300ms without sacrificing the > quality of the results too much. > > I suspect that there would be gains to be made from reducing the > dimensionality of the feature vectors before indexing with lucene so I may > give that a try. I'll keep you posted if I come up with other solutions. > > Thanks, > Kris > > > > 2010/6/8 Olivier Grisel > > > 2010/6/8 Kris Jack : > > > Hi everyone, > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > documents that are related to one another. A single call, however, > takes > > > around 4 seconds to complete and I would like to reduce this. I got to > > > thinking that I might be able to use Mahout to generate a document > > > similarity matrix offline that could then be looked-up in real time for > > > serving. Is this a reasonable use of Mahout? If so, what functions > will > > > generate a document similarity matrix? Also, I would like to be able > to > > > keep the text processing advantages provided through lucene so it would > > help > > > if I could still use my lucene index. If not, then could you recommend > > any > > > alternative solutions please? > > > > How many documents do you have in your index? Have you tried to tweak > > the MoreLikeThis parameters ? (I don't know if it's possible using the > > solr interface, I use it directly using the lucene java API) > > > > For instance you can trade off recall for speed by decreasing the > > number of terms to use in the query and trade recall for precision and > > speed by increasing the percentage of terms that should match. > > > > You could also use Mahout implementation of SVD to build low > > dimensional semantic vectors representing your documents (a.k.a. > > Latent Semantic Indexing) and then index those transformed frequency > > vectors in a dedicated lucene index (or document field provided you > > name the resulting terms with something that does not match real life > > terms present in other). However using standard SVD will probably > > result in dense (as opposed to sparse) low dimensional semantic > > vectors. I don't think lucene's lookup performance is good with dense > > frequency vectors even though the number of terms is greatly reduced > > by SVD. Hence it would probably be better to either threshold the top > > 100 absolute values of each semantic vectors before indexing (probably > > the simpler solution) or using a sparsifying penalty contrained > > variant of SVD / LSI. You should have a look at the literature on > > sparse coding or sparse dictionary learning, Sparse-PCA and more > > generally L1 penalty regression methods such as the Lasso and LARS. I > > don't know about any library for sparse semantic coding of document > > that works automatically with lucene. Probably some non trivial coding > > is needed there. > > > > Another alternative is finding low dimensional (64 or 32 components) > > dense codes and then binary thresholding then and store integer code > > in the DB or the lucene index and then build smart exact match queries > > to find all document lying in the hamming ball of size 1 or 2 of the > > reference document's binary code. But I think this approach while > > promising for web scale document collections is even more experimental > > and requires very good code low dim encoders (I don't think linear > > models such as SVD are good enough for reducing sparse 10e6 components > > vectors to dense 64 components vectors, non linear encoders such as > > Stacked Restricted Boltzmann Machines are probably a better choice). > > > > In any case let us know about your results, I am really interested on > > practical yet scalable solutions to this problem. > > > > -- > > Olivier > > http://twitter.com/ogrisel - http://github.com/ogrisel > > > > > > -- > Dr Kris Jack, > http://www.mendeley.com/profiles/kris-jack/ > --00504502e12ff4e21704888742ef--