spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Date Fri, 25 Apr 2014 21:57:46 GMT
Another interesting benchmark.

*News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.*

LBFGS converges in 70 seconds, while GD seems to be not progressing.

Dense feature vector will be too big to fit in the memory, so only conduct
the sparse benchmark.

I saw the sometimes the loss bumps up, and it's weird for me. Since the
cost function of logistic regression is convex, it should be monotonically
decreasing.  David, any suggestion?

The detail figure:
https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf


*Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*

LBFGS converges in 25 seconds, while GD also seems to be not progressing.

Only conduct sparse benchmark for the same reason. I also saw the loss
bumps up for unknown reason.

The detail figure:
https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbtsai@stanford.edu> wrote:

> rcv1.binary is too sparse (0.15% non-zero elements), so dense format will
> not run due to out of memory. But sparse format runs really well.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>
>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>> data.count()
>>
>> See the following. Running rcv1 dataset now, and will update soon.
>>
>>     val startTime = System.nanoTime()
>>     for (i <- 1 to numIterations) {
>>       // Sample a subset (fraction miniBatchFraction) of the total data
>>       // compute and sum up the subgradients on this subset (this is one
>> map-reduce)
>>       val (gradientSum, lossSum) = data.sample(false, miniBatchFraction,
>> 42 + i)
>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>           seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
>> features)) =>
>>             val l = gradient.compute(features, label, weights,
>> Vectors.fromBreeze(grad))
>>             (grad, loss + l)
>>           },
>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
>> (grad2, loss2)) =>
>>             (grad1 += grad2, loss1 + loss2)
>>           })
>>
>>       /**
>>        * NOTE(Xinghao): lossSum is computed using the weights from the
>> previous iteration
>>        * and regVal is the regularization value computed in the previous
>> iteration as well.
>>        */
>>       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
>>       val update = updater.compute(
>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>> stepSize, i, regParam)
>>       weights = update._1
>>       regVal = update._2
>>       timeStamp.append(System.nanoTime() - startTime)
>>     }
>>
>>
>>
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>>> I don't understand why sparse falls behind dense so much at the very
>>> first iteration. I didn't see count() is called in
>>>
>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>> . Maybe you have local uncommitted changes.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbtsai@stanford.edu> wrote:
>>> > Hi Xiangrui,
>>> >
>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I
>>> specified
>>> > are the same as the actual running executors.
>>> >
>>> > For caching and materialization, I've the timer in optimizer after
>>> calling
>>> > count(); as a result, the time for materialization in cache isn't in
>>> the
>>> > benchmark.
>>> >
>>> > The difference you saw is actually from dense feature or sparse feature
>>> > vector. For LBFGS and GD dense feature, you can see the first iteration
>>> > takes the same time. It's true for GD.
>>> >
>>> > I'm going to run rcv1.binary which only has 0.15% non-zero elements to
>>> > verify the hypothesis.
>>> >
>>> >
>>> > Sincerely,
>>> >
>>> > DB Tsai
>>> > -------------------------------------------------------
>>> > My Blog: https://www.dbtsai.com
>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >
>>> >
>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <mengxr@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi DB,
>>> >>
>>> >> I saw you are using yarn-cluster mode for the benchmark. I tested the
>>> >> yarn-cluster mode and found that YARN does not always give you the
>>> >> exact number of executors requested. Just want to confirm that you've
>>> >> checked the number of executors.
>>> >>
>>> >> The second thing to check is that in the benchmark code, after you
>>> >> call cache, you should also call count() to materialize the RDD. I saw
>>> >> in the result, the real difference is actually at the first step.
>>> >> Adding intercept is not a cheap operation for sparse vectors.
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >>
>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <mengxr@gmail.com>
>>> wrote:
>>> >> > I don't think it is easy to make sparse faster than dense with
this
>>> >> > sparsity and feature dimension. You can try rcv1.binary, which
>>> should
>>> >> > show the difference easily.
>>> >> >
>>> >> > David, the breeze operators used here are
>>> >> >
>>> >> > 1. DenseVector dot SparseVector
>>> >> > 2. axpy DenseVector SparseVector
>>> >> >
>>> >> > However, the SparseVector is passed in as Vector[Double] instead
of
>>> >> > SparseVector[Double]. It might use the axpy impl of [DenseVector,
>>> >> > Vector] and call activeIterator. I didn't check whether you used
>>> >> > multimethods on axpy.
>>> >> >
>>> >> > Best,
>>> >> > Xiangrui
>>> >> >
>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbtsai@stanford.edu>
>>> wrote:
>>> >> >> The figure showing the Log-Likelihood vs Time can be found
here.
>>> >> >>
>>> >> >>
>>> >> >>
>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>> >> >>
>>> >> >> Let me know if you can not open it. Thanks.
>>> >> >>
>>> >> >> Sincerely,
>>> >> >>
>>> >> >> DB Tsai
>>> >> >> -------------------------------------------------------
>>> >> >> My Blog: https://www.dbtsai.com
>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>>> >> >> <shivaram@eecs.berkeley.edu> wrote:
>>> >> >>> I don't think the attachment came through in the list.
Could you
>>> >> >>> upload the
>>> >> >>> results somewhere and link to them ?
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbtsai@dbtsai.com>
>>> wrote:
>>> >> >>>>
>>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.sparks@gmail.com>
>>> wrote:
>>> >> >>>>
>>> >> >>>> > What is the number of non zeroes per row (and
number of
>>> features)
>>> >> >>>> > in the
>>> >> >>>> > sparse case? We've hit some issues with breeze
sparse support
>>> in
>>> >> >>>> > the
>>> >> >>>> > past
>>> >> >>>> > but for sufficiently sparse data it's still pretty
good.
>>> >> >>>> >
>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbtsai@stanford.edu>
>>> wrote:
>>> >> >>>> > >
>>> >> >>>> > > Hi all,
>>> >> >>>> > >
>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib
using the newly
>>> >> >>>> > > added
>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset
and the same
>>> >> >>>> > methodology
>>> >> >>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>> >> >>>> > >
>>> >> >>>> > > I want to know how Spark scale while adding
workers, and how
>>> >> >>>> > > optimizers
>>> >> >>>> > and input format (sparse or dense) impact performance.
>>> >> >>>> > >
>>> >> >>>> > > The benchmark code can be found here,
>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>> >> >>>> > >
>>> >> >>>> > > The first dataset I benchmarked is a9a which
only has 2.2MB.
>>> I
>>> >> >>>> > duplicated the dataset, and made it 762MB to have
11M rows.
>>> This
>>> >> >>>> > dataset
>>> >> >>>> > has 123 features and 11% of the data are non-zero
elements.
>>> >> >>>> > >
>>> >> >>>> > > In this benchmark, all the dataset is cached
in memory.
>>> >> >>>> > >
>>> >> >>>> > > As we expect, LBFGS converges faster than
GD, and at some
>>> point,
>>> >> >>>> > > no
>>> >> >>>> > matter how we push GD, it will converge slower
and slower.
>>> >> >>>> > >
>>> >> >>>> > > However, it's surprising that sparse format
runs slower than
>>> >> >>>> > > dense
>>> >> >>>> > format. I did see that sparse format takes significantly
>>> smaller
>>> >> >>>> > amount
>>> >> >>>> > of
>>> >> >>>> > memory in caching RDD, but sparse is 40% slower
than dense. I
>>> think
>>> >> >>>> > sparse
>>> >> >>>> > should be fast since when we compute x wT, since
x is sparse,
>>> we
>>> >> >>>> > can do
>>> >> >>>> > it
>>> >> >>>> > faster. I wonder if there is anything I'm doing
wrong.
>>> >> >>>> > >
>>> >> >>>> > > The attachment is the benchmark result.
>>> >> >>>> > >
>>> >> >>>> > > Thanks.
>>> >> >>>> > >
>>> >> >>>> > > Sincerely,
>>> >> >>>> > >
>>> >> >>>> > > DB Tsai
>>> >> >>>> > > -------------------------------------------------------
>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> >>>> >
>>> >> >>>
>>> >> >>>
>>> >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message