Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C7A9564DF for ; Wed, 1 Jun 2011 02:20:11 +0000 (UTC) Received: (qmail 67529 invoked by uid 500); 1 Jun 2011 02:20:10 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 67442 invoked by uid 500); 1 Jun 2011 02:20:10 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 67434 invoked by uid 99); 1 Jun 2011 02:20:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 02:20:10 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_FRT_LOLITA1,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.212.42 as permitted sender) Received: from [209.85.212.42] (HELO mail-vw0-f42.google.com) (209.85.212.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 02:20:04 +0000 Received: by vwl1 with SMTP id 1so8971715vwl.1 for ; Tue, 31 May 2011 19:19:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=aeXyr8d39zS+ysLyTwL/kI+MIa1NPkQS+PYGc2NtSgQ=; b=A0+hgk2S12uNWJdZ5vSgnyyQ8bm1IMrnUHzoTb1XsOgVLaRn01548IygdFGC0v6hSP nGpV4cr1qXX1AIdU+BN/dzikodxTE3ReRp+pHavISEXVd3DgrpFPj7Vq/tJcQL4TeT4/ ELsUVvTt0bhaAQKa3Dl43LeHBTjEv+3mtAlik= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=rKidJ2AS8Vs2W7uob3S6C3gneFDPQ7JRpYZpMN+GioQb438TsRhAbBeRrQlUFYO8sP QpBdeFn0zmUUdmTUwUK9iEHYvagA5AP58/fV8bhyQakukzDExryzyS59CJitYtV5Yfi8 OAq9PoMhlChYi3Zxqi1huuUZCNwIzV49YsFbw= MIME-Version: 1.0 Received: by 10.52.183.136 with SMTP id em8mr1588368vdc.262.1306894783211; Tue, 31 May 2011 19:19:43 -0700 (PDT) Received: by 10.52.108.99 with HTTP; Tue, 31 May 2011 19:19:42 -0700 (PDT) In-Reply-To: References: Date: Tue, 31 May 2011 19:19:42 -0700 Message-ID: Subject: Re: Which exact algorithm is used in the Mahout SGD? From: Dmitriy Lyubimov To: user@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Interesting. i'd probably be interested to try it out. On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu wrote: > Thanks Ted and Lance. And sorry for the jargon. > > For the delay Ted mentioned, we have already considered that, still thank= s a > lot for all the detail ideas, they were pretty helpful. > For the parallelized SGD, just found a new paper about using DSGD in matr= ix > factorization, it's different from logistic regression, but might helpful= as > well. Put the title here "Large-Scale Matrix Factorization with Distribut= ed > Stochastic Gradient Descent" if anyone is interested. > > Best wishes, > Stanley Xu > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning wrot= e: > >> Yes. >> >> Apologies for jargon and TLA< >> http://en.wikipedia.org/wiki/Three-letter_acronym> >> 's >> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog wrote= : >> >> > CTR =3D=3D Clickthrough Rate >> > >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning >> > wrote: >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu >> wrote: >> > > >> > >> ... I understood as the algorithm, the time in training only relies= on >> > the >> > >> non-zero records, but per our test, there would be some overhead we >> > could >> > >> not ignore for thoso non-zero records, though the cost is sub-linea= r >> or >> > >> logit to the length of the hashed vector. >> > >> >> > > >> > > This is pretty close if we say "non-zero values". =A0A record usuall= y >> > refers >> > > to an entire training >> > > example. >> > > >> > > The extra work refers mostly to deferred regularization that eventua= lly >> > has >> > > to be >> > > applied. =A0My guess is that it is even less than log in the feature >> vector >> > > size. >> > > >> > > >> > >> And in CTR prediction, I am not pretty sure it will converge very >> > quickly. >> > >> >> > > >> > > I was saying this purely based on the number of features. >> > > >> > > >> > >> Because we will very possibly see some records has the almost same >> > feature >> > >> but different result in display ads. >> > > >> > > >> > > The algorithm can still converge to an estimate of the probability >> here. >> > > >> > > >> > >> But we will see the result in the >> > >> future. >> > > >> > > >> > > You have to be *very* careful about this to avoid prejudicing the mo= del >> > > against >> > > recent impressions. =A0If you have a fast feedback to the ad targeti= ng >> > system, >> > > you >> > > can have severely instability. >> > > >> > > The key thing that you have to do to avoid these biases is to define= a >> > > maximum >> > > delay before click for the purposes of modeling. =A0You need to igno= re >> all >> > > impressions >> > > younger than this delay (because they may still get a click) and you >> need >> > to >> > > ignore >> > > all clicks after this delay (to avoid bias in favor of old >> impressions). >> > > =A0For on-line ads >> > > you can probably use a maximum delay of a few minutes because most >> clicks >> > > will >> > > happen by then. >> > > >> > > To find a good value for maximum delay, you should plot the CTR for = a >> > bunch >> > > of >> > > ads versus delay. =A0This will increase rapidly shortly after zero d= elay, >> > but >> > > then will >> > > level off. =A0The ordering of ads by CTR is what you care about so y= ou >> can >> > > follow the >> > > curves back and find the shortest delay where the ordering is clearl= y >> > > preserved. =A0Use >> > > that as your maximum delay. =A0Typically this is roughly where your = CTR >> is >> > at >> > > about >> > > 80-90% of the final value. >> > > >> > > >> > > >> > > >> > >> (We were still working on creating a framework to digg all the >> > >> features we need from the log, I would like to share our experience= by >> > >> using >> > >> Mahout SGD once we got our CTR prediction model release.) >> > >> >> > >> And for parallelize SGD, what do you mean for help with sparse inpu= ts >> > that >> > >> exhibit long-tail frequency distribution? Would you like to share s= ome >> > of >> > >> your ideas, Ted? >> > >> >> > >> Currently, what I could think about is split the data to multiple >> mapper >> > >> randomly and let every mapper to learn from the local data and get = an >> > >> average on the whole model, or let multiple model to vote for every >> > >> feature's weight. A little like the idea of AdaBoost or RandomFores= t. >> > But I >> > >> am not a scientist or mathematician, so no idea if it is correct or >> not. >> > >> >> > >> >> > >> Thanks so much. >> > >> Stanley Xu >> > >> >> > >> >> > >> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning >> > >> wrote: >> > >> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu >> > >> wrote: >> > >> > >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data sc= ale >> I >> > >> mean >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines o= f >> > >> training >> > >> > > set(900M training example.). If every training data has 1000 >> > dimension, >> > >> > it >> > >> > > means 900 million X 1000 X 16 B =3D 14TB. If we reduce the logs >> > collected >> > >> > to >> > >> > > 14 days, it would be still 2-3TB data. >> > >> > > >> > >> > >> > >> > Oops. =A0Forgot that last multiplier. >> > >> > >> > >> > >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, i= t >> > will >> > >> > take >> > >> > > about 1-2 hours to do the training, so 90M lines of data will c= ost >> > at >> > >> > least >> > >> > > 90 hours, is that correct? >> > >> > > >> > >> > >> > >> > 10M x 1000 x 8 =3D 80 GB. >> > >> > >> > >> > 1-2 hours =3D (approx) 5000 seconds. =A0So this is >> > >> > >> > >> > 80 GB / 5000 s =3D 80/5 MB /s =3D 16MB / s >> > >> > >> > >> > Yes. =A0This is reasonable speed. =A0I think you can get a small = factor >> > >> faster >> > >> > than this with SGD. =A0I have seen 100 million records with more >> > non-zero >> > >> > values than you describe with a training time of 3 hours. =A0I wo= uld >> not >> > >> > expect even as much as a factor of 10 speedup here. >> > >> > >> > >> > >> > >> > > >> > >> > > And from the PPT you provided >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010 >> > >> > > You said it would take less than an hour for 20M data records f= or >> > >> > > numeric/category mixed dimensions. I am wondering, how many >> > dimensions >> > >> > per >> > >> > > record? >> > >> > > >> > >> > >> > >> > These are sparse records records with about a thousand non-zero >> > elements >> > >> > per >> > >> > record. >> > >> > >> > >> > >> > >> > But let's step back to your data for a moment. =A0Where do these >> > thousand >> > >> > dimensions come from? =A0Do you really have a thousand hand-built >> > features? >> > >> > =A0Do you not have any sparse, text-like features? >> > >> > >> > >> > If you really only have a thousand dimensional problem, then I th= ink >> > your >> > >> > model might exhibit early convergence. >> > >> > >> > >> > If not, it is quite possible to parallelize SGD, but this is only >> > likely >> > >> to >> > >> > help with sparse inputs that exhibit long-tail frequency >> distribution. >> > >> > >> > >> >> > > >> > >> > >> > >> > -- >> > Lance Norskog >> > goksron@gmail.com >> > >> >