spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Date Thu, 24 Apr 2014 05:35:14 GMT
The figure showing the Log-Likelihood vs Time can be found here.

https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

Let me know if you can not open it. Thanks.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
<shivaram@eecs.berkeley.edu> wrote:
> I don't think the attachment came through in the list. Could you upload the
> results somewhere and link to them ?
>
>
> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbtsai@dbtsai.com> wrote:
>>
>> 123 features per rows, and in average, 89% are zeros.
>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.sparks@gmail.com> wrote:
>>
>> > What is the number of non zeroes per row (and number of features) in the
>> > sparse case? We've hit some issues with breeze sparse support in the
>> > past
>> > but for sufficiently sparse data it's still pretty good.
>> >
>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> > methodology
>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> > >
>> > > I want to know how Spark scale while adding workers, and how
>> > > optimizers
>> > and input format (sparse or dense) impact performance.
>> > >
>> > > The benchmark code can be found here,
>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> > >
>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
>> > has 123 features and 11% of the data are non-zero elements.
>> > >
>> > > In this benchmark, all the dataset is cached in memory.
>> > >
>> > > As we expect, LBFGS converges faster than GD, and at some point, no
>> > matter how we push GD, it will converge slower and slower.
>> > >
>> > > However, it's surprising that sparse format runs slower than dense
>> > format. I did see that sparse format takes significantly smaller amount
>> > of
>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>> > sparse
>> > should be fast since when we compute x wT, since x is sparse, we can do
>> > it
>> > faster. I wonder if there is anything I'm doing wrong.
>> > >
>> > > The attachment is the benchmark result.
>> > >
>> > > Thanks.
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > -------------------------------------------------------
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>
>

Mime
View raw message