Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 52077 invoked from network); 8 Jun 2010 23:17:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jun 2010 23:17:04 -0000 Received: (qmail 3803 invoked by uid 500); 8 Jun 2010 23:17:04 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 3737 invoked by uid 500); 8 Jun 2010 23:17:04 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 3729 invoked by uid 99); 8 Jun 2010 23:17:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 23:17:04 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-px0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 23:16:57 +0000 Received: by pxi6 with SMTP id 6so2324906pxi.1 for ; Tue, 08 Jun 2010 16:16:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=1/kEzEWzmAmAcvl8VJFdkdtJNM9SrrJq+pxQdklR/IM=; b=XMKQ9yucgADlQJI9QUVvx4/k1NmLZvwsnRWv3Mf0yN+o66APhmKXaEwyvlEK+jI6Q/ ydUhNTmmKdkC63IQehZ4rnbp9WSHjYUbx2UJckKLfzpDZj7ztGpANFnVtsq+L7iOy8hl GqFsblPIsKONdAPTdKJ2XYMhmkdXU3qwC8CtY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=hebrLBYNIyKogJukuuoCqnW04LUoBTO4bLbhf0OjO9HvKxJrUEr1hcbTjQ28y1Pd5F OBa0KFCosQzCklM28rG2UqqRvuEMCO7lbmmPZ3D/t+SJW2It7KRrYQ9mQJBpTL0VyYZu VP17CuGJsR/p4lqWQ/W8Hab2FK+jnG0cjVmFw= Received: by 10.142.74.1 with SMTP id w1mr117902wfa.258.1276038996134; Tue, 08 Jun 2010 16:16:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.115.32.8 with HTTP; Tue, 8 Jun 2010 16:16:16 -0700 (PDT) In-Reply-To: References: From: Jake Mannix Date: Tue, 8 Jun 2010 16:16:16 -0700 Message-ID: Subject: Re: Generating a Document Similarity Matrix To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001636e1fbfd158c8c04888cfb23 X-Virus-Checked: Checked by ClamAV on apache.org --001636e1fbfd158c8c04888cfb23 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel wrote: > 2010/6/8 Jake Mannix : > > Hi Kris, > > > > If you generate a full document-document similarity matrix offline, and > > then make sure to sparsify the rows (trim off all similarities below a > > threshold, or only take the top N for each row, etc...). Then encoding > > these values directly in the index would indeed allow for *superfast* > > MoreLikeThis functionality, because you've already computed all > > of the similar results offline. > > For 10e6 documents if might not be reasonable to generate the complete > document-document similarity matrix: 1e12 components => a couple of > tera bytes of similarity values just to find the find the top N > afterwards: Nope, this isn't what happens in what I described: when you take a sparseDocumentMatrix.transpose().times(itself), the scaling does not go N^2*M, with N^2 outputs - the calculation is sparse, only computing the entries which are nonzero. If you pre-sparsify the documents a little (remove all terms which occur in more than 1% of all documents, or something like that), this sparse calculation is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the number of nonzero elements in document i. If all documents were the same length (k), then this scales as N*k^2, and the total number of nonzero entries in the output is far less than N^2 if k << N,M. Getting rid of the common terms (even *lots* of them) beforehand is still a very good idea. -jake --001636e1fbfd158c8c04888cfb23--