spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Date Thu, 24 Apr 2014 21:36:59 GMT
rcv1.binary is too sparse (0.15% non-zero elements), so dense format will
not run due to out of memory. But sparse format runs really well.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbtsai@stanford.edu> wrote:

> I'm doing the timer in runMiniBatchSGD after  val numExamples =
> data.count()
>
> See the following. Running rcv1 dataset now, and will update soon.
>
>     val startTime = System.nanoTime()
>     for (i <- 1 to numIterations) {
>       // Sample a subset (fraction miniBatchFraction) of the total data
>       // compute and sum up the subgradients on this subset (this is one
> map-reduce)
>       val (gradientSum, lossSum) = data.sample(false, miniBatchFraction,
> 42 + i)
>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>           seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
> features)) =>
>             val l = gradient.compute(features, label, weights,
> Vectors.fromBreeze(grad))
>             (grad, loss + l)
>           },
>           combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
> (grad2, loss2)) =>
>             (grad1 += grad2, loss1 + loss2)
>           })
>
>       /**
>        * NOTE(Xinghao): lossSum is computed using the weights from the
> previous iteration
>        * and regVal is the regularization value computed in the previous
> iteration as well.
>        */
>       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
>       val update = updater.compute(
>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
> stepSize, i, regParam)
>       weights = update._1
>       regVal = update._2
>       timeStamp.append(System.nanoTime() - startTime)
>     }
>
>
>
>
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>
>> I don't understand why sparse falls behind dense so much at the very
>> first iteration. I didn't see count() is called in
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>> . Maybe you have local uncommitted changes.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbtsai@stanford.edu> wrote:
>> > Hi Xiangrui,
>> >
>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I
>> specified
>> > are the same as the actual running executors.
>> >
>> > For caching and materialization, I've the timer in optimizer after
>> calling
>> > count(); as a result, the time for materialization in cache isn't in the
>> > benchmark.
>> >
>> > The difference you saw is actually from dense feature or sparse feature
>> > vector. For LBFGS and GD dense feature, you can see the first iteration
>> > takes the same time. It's true for GD.
>> >
>> > I'm going to run rcv1.binary which only has 0.15% non-zero elements to
>> > verify the hypothesis.
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <mengxr@gmail.com>
>> wrote:
>> >>
>> >> Hi DB,
>> >>
>> >> I saw you are using yarn-cluster mode for the benchmark. I tested the
>> >> yarn-cluster mode and found that YARN does not always give you the
>> >> exact number of executors requested. Just want to confirm that you've
>> >> checked the number of executors.
>> >>
>> >> The second thing to check is that in the benchmark code, after you
>> >> call cache, you should also call count() to materialize the RDD. I saw
>> >> in the result, the real difference is actually at the first step.
>> >> Adding intercept is not a cheap operation for sparse vectors.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <mengxr@gmail.com>
>> wrote:
>> >> > I don't think it is easy to make sparse faster than dense with this
>> >> > sparsity and feature dimension. You can try rcv1.binary, which should
>> >> > show the difference easily.
>> >> >
>> >> > David, the breeze operators used here are
>> >> >
>> >> > 1. DenseVector dot SparseVector
>> >> > 2. axpy DenseVector SparseVector
>> >> >
>> >> > However, the SparseVector is passed in as Vector[Double] instead of
>> >> > SparseVector[Double]. It might use the axpy impl of [DenseVector,
>> >> > Vector] and call activeIterator. I didn't check whether you used
>> >> > multimethods on axpy.
>> >> >
>> >> > Best,
>> >> > Xiangrui
>> >> >
>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbtsai@stanford.edu>
>> wrote:
>> >> >> The figure showing the Log-Likelihood vs Time can be found here.
>> >> >>
>> >> >>
>> >> >>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>> >> >>
>> >> >> Let me know if you can not open it. Thanks.
>> >> >>
>> >> >> Sincerely,
>> >> >>
>> >> >> DB Tsai
>> >> >> -------------------------------------------------------
>> >> >> My Blog: https://www.dbtsai.com
>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>> >> >> <shivaram@eecs.berkeley.edu> wrote:
>> >> >>> I don't think the attachment came through in the list. Could
you
>> >> >>> upload the
>> >> >>> results somewhere and link to them ?
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbtsai@dbtsai.com>
>> wrote:
>> >> >>>>
>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.sparks@gmail.com>
>> wrote:
>> >> >>>>
>> >> >>>> > What is the number of non zeroes per row (and number
of
>> features)
>> >> >>>> > in the
>> >> >>>> > sparse case? We've hit some issues with breeze sparse
support in
>> >> >>>> > the
>> >> >>>> > past
>> >> >>>> > but for sufficiently sparse data it's still pretty
good.
>> >> >>>> >
>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbtsai@stanford.edu>
>> wrote:
>> >> >>>> > >
>> >> >>>> > > Hi all,
>> >> >>>> > >
>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib
using the newly
>> >> >>>> > > added
>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset
and the same
>> >> >>>> > methodology
>> >> >>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> >> >>>> > >
>> >> >>>> > > I want to know how Spark scale while adding workers,
and how
>> >> >>>> > > optimizers
>> >> >>>> > and input format (sparse or dense) impact performance.
>> >> >>>> > >
>> >> >>>> > > The benchmark code can be found here,
>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> >> >>>> > >
>> >> >>>> > > The first dataset I benchmarked is a9a which
only has 2.2MB. I
>> >> >>>> > duplicated the dataset, and made it 762MB to have
11M rows. This
>> >> >>>> > dataset
>> >> >>>> > has 123 features and 11% of the data are non-zero
elements.
>> >> >>>> > >
>> >> >>>> > > In this benchmark, all the dataset is cached
in memory.
>> >> >>>> > >
>> >> >>>> > > As we expect, LBFGS converges faster than GD,
and at some
>> point,
>> >> >>>> > > no
>> >> >>>> > matter how we push GD, it will converge slower and
slower.
>> >> >>>> > >
>> >> >>>> > > However, it's surprising that sparse format runs
slower than
>> >> >>>> > > dense
>> >> >>>> > format. I did see that sparse format takes significantly
smaller
>> >> >>>> > amount
>> >> >>>> > of
>> >> >>>> > memory in caching RDD, but sparse is 40% slower than
dense. I
>> think
>> >> >>>> > sparse
>> >> >>>> > should be fast since when we compute x wT, since x
is sparse, we
>> >> >>>> > can do
>> >> >>>> > it
>> >> >>>> > faster. I wonder if there is anything I'm doing wrong.
>> >> >>>> > >
>> >> >>>> > > The attachment is the benchmark result.
>> >> >>>> > >
>> >> >>>> > > Thanks.
>> >> >>>> > >
>> >> >>>> > > Sincerely,
>> >> >>>> > >
>> >> >>>> > > DB Tsai
>> >> >>>> > > -------------------------------------------------------
>> >> >>>> > > My Blog: https://www.dbtsai.com
>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> >>>> >
>> >> >>>
>> >> >>>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message