Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B1931148B for ; Thu, 24 Apr 2014 05:16:40 +0000 (UTC) Received: (qmail 14391 invoked by uid 500); 24 Apr 2014 05:16:39 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 14260 invoked by uid 500); 24 Apr 2014 05:16:38 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 14252 invoked by uid 99); 24 Apr 2014 05:16:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Apr 2014 05:16:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of david.lw.hall@gmail.com designates 74.125.82.178 as permitted sender) Received: from [74.125.82.178] (HELO mail-we0-f178.google.com) (74.125.82.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Apr 2014 05:16:33 +0000 Received: by mail-we0-f178.google.com with SMTP id u56so1759370wes.9 for ; Wed, 23 Apr 2014 22:16:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=Y751xO8rIzMWl/BdlJBWluON/+nmtuUigyxH8ReD+/w=; b=KZk569/oPoa6TopJ1IcsBf6yGsbJxou816oEffCCvk26xWxMC1FJzlk3w0JxvfR+AT /B9+wVB3QB0KmC5e25K01LCLFZuTto/yF836bRCduuLteHafLCYiXp/YRQ6Y3vzXVWLu z5ZnmOqME1AntB5yyqZs0WaoaeXOOid9R/Z1yoAfwWXZiI3ePRWsL5LbdGwe25NJJZWn netq8DcL8JcsAcoPfTZ3GuyoPJSIbtFtOTXxK+xFGbl9rO8zeaqh7vZP4oNFKXz3kDwf dNa3PK1v3vzIlA/O66xinpTwdHEBtsD0yVIllIhKo8XOZ6ktu5iPZJddo5hezi045IYw AOhA== MIME-Version: 1.0 X-Received: by 10.180.7.198 with SMTP id l6mr4685277wia.52.1398316571869; Wed, 23 Apr 2014 22:16:11 -0700 (PDT) Sender: david.lw.hall@gmail.com Received: by 10.227.232.207 with HTTP; Wed, 23 Apr 2014 22:16:11 -0700 (PDT) In-Reply-To: References: Date: Wed, 23 Apr 2014 22:16:11 -0700 X-Google-Sender-Auth: mChiu8ShHclV0COzML8v5WYTd2k Message-ID: Subject: Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result From: David Hall To: dev@spark.apache.org, dbtsai@dbtsai.com Cc: shivaram@eecs.berkeley.edu, Xiangrui Meng Content-Type: multipart/alternative; boundary=f46d044287c08cc6fd04f7c2f25f X-Virus-Checked: Checked by ClamAV on apache.org --f46d044287c08cc6fd04f7c2f25f Content-Type: text/plain; charset=UTF-8 Was the weight vector sparse? The gradients? Or just the feature vectors? On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai wrote: > The figure showing the Log-Likelihood vs Time can be found here. > > > https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf > > Let me know if you can not open it. > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman < > shivaram@eecs.berkeley.edu> wrote: > > > I don't think the attachment came through in the list. Could you upload > > the results somewhere and link to them ? > > > > > > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai wrote: > > > >> 123 features per rows, and in average, 89% are zeros. > >> On Apr 23, 2014 9:31 PM, "Evan Sparks" wrote: > >> > >> > What is the number of non zeroes per row (and number of features) in > the > >> > sparse case? We've hit some issues with breeze sparse support in the > >> past > >> > but for sufficiently sparse data it's still pretty good. > >> > > >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai wrote: > >> > > > >> > > Hi all, > >> > > > >> > > I'm benchmarking Logistic Regression in MLlib using the newly added > >> > optimizer LBFGS and GD. I'm using the same dataset and the same > >> methodology > >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf > >> > > > >> > > I want to know how Spark scale while adding workers, and how > >> optimizers > >> > and input format (sparse or dense) impact performance. > >> > > > >> > > The benchmark code can be found here, > >> > https://github.com/dbtsai/spark-lbfgs-benchmark > >> > > > >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I > >> > duplicated the dataset, and made it 762MB to have 11M rows. This > dataset > >> > has 123 features and 11% of the data are non-zero elements. > >> > > > >> > > In this benchmark, all the dataset is cached in memory. > >> > > > >> > > As we expect, LBFGS converges faster than GD, and at some point, no > >> > matter how we push GD, it will converge slower and slower. > >> > > > >> > > However, it's surprising that sparse format runs slower than dense > >> > format. I did see that sparse format takes significantly smaller > amount > >> of > >> > memory in caching RDD, but sparse is 40% slower than dense. I think > >> sparse > >> > should be fast since when we compute x wT, since x is sparse, we can > do > >> it > >> > faster. I wonder if there is anything I'm doing wrong. > >> > > > >> > > The attachment is the benchmark result. > >> > > > >> > > Thanks. > >> > > > >> > > Sincerely, > >> > > > >> > > DB Tsai > >> > > ------------------------------------------------------- > >> > > My Blog: https://www.dbtsai.com > >> > > LinkedIn: https://www.linkedin.com/in/dbtsai > >> > > >> > > > > > --f46d044287c08cc6fd04f7c2f25f--