Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 85388 invoked from network); 13 Dec 2010 19:06:17 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Dec 2010 19:06:17 -0000 Received: (qmail 15672 invoked by uid 500); 13 Dec 2010 19:06:16 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 15546 invoked by uid 500); 13 Dec 2010 19:06:16 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 15538 invoked by uid 99); 13 Dec 2010 19:06:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Dec 2010 19:06:16 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-wy0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Dec 2010 19:06:10 +0000 Received: by wyb39 with SMTP id 39so6129391wyb.1 for ; Mon, 13 Dec 2010 11:05:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=c7YSRd+Q+NcWZazYMQY8R03a8bxrHlUB65ZlHZ8tqzk=; b=pIYkomwSLZGg+w04k5CTDdQWvBpl6nX5ptx0TIozK/l18uIoFU/LQq9riLSwBfz7zP JTK49kGC2T2n6oZmz4QfP1fEzd0DyRc57zXdZsSzUW5EuPBrEcL+TGMWlWXxu9wsYXb8 F8SQW26jI2VtV5AVMSIA8zrhsJT9sB9iH3KzI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=lCY6E0QUzRFJLPVtOYk9gDtB7g+GAGurB682osAxvoI85sqzEsmg+I5iwiRYi1/gIQ 4v2sfZbR0YyH16T0PdntsJkhIwvUOBQla0Gh9jTwbpaM7TYwK4qTfP8DbzKhAjA5fTHf 306dsAP+7j0Itwnsqjf9DK32DKDeT3SOqpMrs= MIME-Version: 1.0 Received: by 10.216.141.37 with SMTP id f37mr946229wej.31.1292267148588; Mon, 13 Dec 2010 11:05:48 -0800 (PST) Received: by 10.216.242.193 with HTTP; Mon, 13 Dec 2010 11:05:48 -0800 (PST) In-Reply-To: References: Date: Mon, 13 Dec 2010 11:05:48 -0800 Message-ID: Subject: Re: Sequential access to VectorWritable content proposal. From: Dmitriy Lyubimov To: dev@mahout.apache.org Content-Type: multipart/alternative; boundary=0016e6db2d2b58b61104974f646a --0016e6db2d2b58b61104974f646a Content-Type: text/plain; charset=ISO-8859-1 Jake, No i was trying exactly what you were proposing some time ago on the list. I am trying to make long vectors not to occupy a lot of memory. E.g. a 1m-long dense vector would require 8Mb just to load it. And i am saying, hey, there's a lot of sequential techniques that can provide a hander that would inspect vector element-by-element without having to preallocate 8Mb. for 1 million-long vectors it doesn't scary too much but starts being so for default hadoop memory settings at the area of 50-100Mb (or 5-10 million non-zero elements). Stochastic SVD will survive that, but it means less memory for blocking, and the more blocks you have, the more CPU it requires (although CPU demand is only linear to the number of blocks and only in signficantly smaller part of computation, so that only insigificant part of total CPU flops depends on # of blocks, but there is part that does, still. ) Like i said, it also would address the case when rows don't fit in the memory (hence no memory bound for n of A) but the most immediate benefit is to speed/ scalability/memory req of SSVD in most practical LSI cases. -Dmitriy On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix wrote: > Hey Dmitriy, > > I've also been playing around with a VectorWritable format which is backed > by a > SequenceFile, but I've been focussed on the case where it's essentially the > entire > matrix, and the rows don't fit into memory. This seems different than your > current > use case, however - you just want (relatively) small vectors to load > faster, > right? > > -jake > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning > wrote: > > > Interesting idea. > > > > Would this introduce a new vector type that only allows iterating through > > the elements once? > > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov > > wrote: > > > > > Hi all, > > > > > > I would like to submit a patch to VectorWritable that allows for > > streaming > > > access to vector elements without having to prebuffer all of them > first. > > > (current code allows for the latter only). > > > > > > That patch would allow to strike down one of the memory usage issues in > > > current Stochastic SVD implementation and effectively open memory bound > > for > > > n of the SVD work. (The value i see is not to open up the the bound > > though > > > but just be more efficient in memory use, thus essentially speeding u p > > the > > > computation. ) > > > > > > If it's ok, i would like to create a JIRA issue and provide a patch for > > it. > > > > > > Another issue is to provide an SSVD patch that depends on that patch > for > > > VectorWritable. > > > > > > Thank you. > > > -Dmitriy > > > > > > --0016e6db2d2b58b61104974f646a--