spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Date Mon, 28 Apr 2014 21:36:25 GMT
Hi David,

I got most of the stuff working, and the loss is monotonically decreasing
by getting the history from iterator of state.

However, in the costFun, I need to know what current iteration is it for
miniBatch, which means for one iteration, if optimizer calls costFun
several times for line search, it should pass the same iteration into
costFun. So I pass the lbfgs optimizer into costFun as the following code,
and try to find the current iteration in lbfgs object. Unfortunately, it
seems that the current iteration is not available in this object.

Any idea for getting this in costFun? Originally, I've a counter inside
costFun which gives the # of iterations. However, it's not what I want now
since it also counts line search.

val lbfgs = new BreezeLBFGS[BDV[Double]](maxNumIterations, numCorrections,
convergenceTol)

val costFun =
      new CostFun(data, gradient, updater, miniBatchFraction, lbfgs,
miniBatchSize)

val states = lbfgs.iterations(new CachedDiffFunction(costFun),
initialWeights.toBreeze.toDenseVector)


Thanks.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Apr 28, 2014 at 8:55 AM, David Hall <dlwh@cs.berkeley.edu> wrote:

> That's right.
>
> FWIW, caching should be automatic now, but it might be the version of
> Breeze you're using doesn't do that yet.
>
> Also, In breeze.util._ there's an implicit that adds a tee method to
> iterator, and also a last method. Both are useful for things like this.
>
> -- David
>
>
> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>
>> I think I figure it out. Instead of calling minimize, and record the loss
>> in the DiffFunction, I should do the following.
>>
>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>> initialWeights.toBreeze.toDenseVector)
>> states.foreach(state => lossHistory.append(state.value))
>>
>> All the losses in states should be decreasing now. Am I right?
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>>
>>> Also, how many failure of rejection will terminate the optimization
>>> process? How is it related to "numberOfImprovementFailures"?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>>>
>>>> Hi David,
>>>>
>>>> I'm recording the loss history in the DiffFunction implementation, and
>>>> that's why the rejected step is also recorded in my loss history.
>>>>
>>>> Is there any api in Breeze LBFGS to get the history which already
>>>> excludes the reject step? Or should I just call "iterations" method and
>>>> check "iteratingShouldStop" instead?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> -------------------------------------------------------
>>>> My Blog: https://www.dbtsai.com
>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>
>>>>
>>>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <dlwh@cs.berkeley.edu>wrote:
>>>>
>>>>> LBFGS will not take a step that sends the objective value up. It might
>>>>> try a step that is "too big" and reject it, so if you're just logging
>>>>> everything that gets tried by LBFGS, you could see that. The "iterations"
>>>>> method of the minimizer should never return an increasing objective value.
>>>>> If you're regularizing, are you including the regularizer in the objective
>>>>> value computation?
>>>>>
>>>>> GD is almost never worth your time.
>>>>>
>>>>> -- David
>>>>>
>>>>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbtsai@stanford.edu>
wrote:
>>>>>
>>>>>> Another interesting benchmark.
>>>>>>
>>>>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>>>>>> elements.*
>>>>>>
>>>>>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>>>>>
>>>>>> Dense feature vector will be too big to fit in the memory, so only
>>>>>> conduct the sparse benchmark.
>>>>>>
>>>>>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>>>>>> the cost function of logistic regression is convex, it should be
>>>>>> monotonically decreasing.  David, any suggestion?
>>>>>>
>>>>>> The detail figure:
>>>>>>
>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>>>>>
>>>>>>
>>>>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>>>>>
>>>>>> LBFGS converges in 25 seconds, while GD also seems to be not
>>>>>> progressing.
>>>>>>
>>>>>> Only conduct sparse benchmark for the same reason. I also saw the
>>>>>> loss bumps up for unknown reason.
>>>>>>
>>>>>> The detail figure:
>>>>>>
>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> -------------------------------------------------------
>>>>>> My Blog: https://www.dbtsai.com
>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbtsai@stanford.edu>
wrote:
>>>>>>
>>>>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense
>>>>>>> format will not run due to out of memory. But sparse format runs
really
>>>>>>> well.
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> DB Tsai
>>>>>>> -------------------------------------------------------
>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbtsai@stanford.edu>wrote:
>>>>>>>
>>>>>>>> I'm doing the timer in runMiniBatchSGD after  val numExamples
=
>>>>>>>> data.count()
>>>>>>>>
>>>>>>>> See the following. Running rcv1 dataset now, and will update
soon.
>>>>>>>>
>>>>>>>>     val startTime = System.nanoTime()
>>>>>>>>     for (i <- 1 to numIterations) {
>>>>>>>>       // Sample a subset (fraction miniBatchFraction) of
the total
>>>>>>>> data
>>>>>>>>       // compute and sum up the subgradients on this subset
(this
>>>>>>>> is one map-reduce)
>>>>>>>>       val (gradientSum, lossSum) = data.sample(false,
>>>>>>>> miniBatchFraction, 42 + i)
>>>>>>>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>>>>>>           seqOp = (c, v) => (c, v) match { case ((grad,
loss),
>>>>>>>> (label, features)) =>
>>>>>>>>             val l = gradient.compute(features, label, weights,
>>>>>>>> Vectors.fromBreeze(grad))
>>>>>>>>             (grad, loss + l)
>>>>>>>>           },
>>>>>>>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1,
>>>>>>>> loss1), (grad2, loss2)) =>
>>>>>>>>             (grad1 += grad2, loss1 + loss2)
>>>>>>>>           })
>>>>>>>>
>>>>>>>>       /**
>>>>>>>>        * NOTE(Xinghao): lossSum is computed using the weights
from
>>>>>>>> the previous iteration
>>>>>>>>        * and regVal is the regularization value computed
in the
>>>>>>>> previous iteration as well.
>>>>>>>>        */
>>>>>>>>       stochasticLossHistory.append(lossSum / miniBatchSize
+ regVal)
>>>>>>>>       val update = updater.compute(
>>>>>>>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>>>>>>>> stepSize, i, regParam)
>>>>>>>>       weights = update._1
>>>>>>>>       regVal = update._2
>>>>>>>>       timeStamp.append(System.nanoTime() - startTime)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> DB Tsai
>>>>>>>> -------------------------------------------------------
>>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <mengxr@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> I don't understand why sparse falls behind dense so much
at the
>>>>>>>>> very
>>>>>>>>> first iteration. I didn't see count() is called in
>>>>>>>>>
>>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>>>>>>>> . Maybe you have local uncommitted changes.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Xiangrui
>>>>>>>>>
>>>>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbtsai@stanford.edu>
>>>>>>>>> wrote:
>>>>>>>>> > Hi Xiangrui,
>>>>>>>>> >
>>>>>>>>> > Yes, I'm using yarn-cluster mode, and I did check
# of executors
>>>>>>>>> I specified
>>>>>>>>> > are the same as the actual running executors.
>>>>>>>>> >
>>>>>>>>> > For caching and materialization, I've the timer
in optimizer
>>>>>>>>> after calling
>>>>>>>>> > count(); as a result, the time for materialization
in cache
>>>>>>>>> isn't in the
>>>>>>>>> > benchmark.
>>>>>>>>> >
>>>>>>>>> > The difference you saw is actually from dense feature
or sparse
>>>>>>>>> feature
>>>>>>>>> > vector. For LBFGS and GD dense feature, you can
see the first
>>>>>>>>> iteration
>>>>>>>>> > takes the same time. It's true for GD.
>>>>>>>>> >
>>>>>>>>> > I'm going to run rcv1.binary which only has 0.15%
non-zero
>>>>>>>>> elements to
>>>>>>>>> > verify the hypothesis.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Sincerely,
>>>>>>>>> >
>>>>>>>>> > DB Tsai
>>>>>>>>> > -------------------------------------------------------
>>>>>>>>> > My Blog: https://www.dbtsai.com
>>>>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <mengxr@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Hi DB,
>>>>>>>>> >>
>>>>>>>>> >> I saw you are using yarn-cluster mode for the
benchmark. I
>>>>>>>>> tested the
>>>>>>>>> >> yarn-cluster mode and found that YARN does not
always give you
>>>>>>>>> the
>>>>>>>>> >> exact number of executors requested. Just want
to confirm that
>>>>>>>>> you've
>>>>>>>>> >> checked the number of executors.
>>>>>>>>> >>
>>>>>>>>> >> The second thing to check is that in the benchmark
code, after
>>>>>>>>> you
>>>>>>>>> >> call cache, you should also call count() to
materialize the
>>>>>>>>> RDD. I saw
>>>>>>>>> >> in the result, the real difference is actually
at the first
>>>>>>>>> step.
>>>>>>>>> >> Adding intercept is not a cheap operation for
sparse vectors.
>>>>>>>>> >>
>>>>>>>>> >> Best,
>>>>>>>>> >> Xiangrui
>>>>>>>>> >>
>>>>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng
<
>>>>>>>>> mengxr@gmail.com> wrote:
>>>>>>>>> >> > I don't think it is easy to make sparse
faster than dense
>>>>>>>>> with this
>>>>>>>>> >> > sparsity and feature dimension. You can
try rcv1.binary,
>>>>>>>>> which should
>>>>>>>>> >> > show the difference easily.
>>>>>>>>> >> >
>>>>>>>>> >> > David, the breeze operators used here are
>>>>>>>>> >> >
>>>>>>>>> >> > 1. DenseVector dot SparseVector
>>>>>>>>> >> > 2. axpy DenseVector SparseVector
>>>>>>>>> >> >
>>>>>>>>> >> > However, the SparseVector is passed in
as Vector[Double]
>>>>>>>>> instead of
>>>>>>>>> >> > SparseVector[Double]. It might use the
axpy impl of
>>>>>>>>> [DenseVector,
>>>>>>>>> >> > Vector] and call activeIterator. I didn't
check whether you
>>>>>>>>> used
>>>>>>>>> >> > multimethods on axpy.
>>>>>>>>> >> >
>>>>>>>>> >> > Best,
>>>>>>>>> >> > Xiangrui
>>>>>>>>> >> >
>>>>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai
<
>>>>>>>>> dbtsai@stanford.edu> wrote:
>>>>>>>>> >> >> The figure showing the Log-Likelihood
vs Time can be found
>>>>>>>>> here.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>>>>>>> >> >>
>>>>>>>>> >> >> Let me know if you can not open it.
Thanks.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Sincerely,
>>>>>>>>> >> >>
>>>>>>>>> >> >> DB Tsai
>>>>>>>>> >> >> -------------------------------------------------------
>>>>>>>>> >> >> My Blog: https://www.dbtsai.com
>>>>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram
Venkataraman
>>>>>>>>> >> >> <shivaram@eecs.berkeley.edu>
wrote:
>>>>>>>>> >> >>> I don't think the attachment came
through in the list.
>>>>>>>>> Could you
>>>>>>>>> >> >>> upload the
>>>>>>>>> >> >>> results somewhere and link to them
?
>>>>>>>>> >> >>>
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM,
DB Tsai <dbtsai@dbtsai.com>
>>>>>>>>> wrote:
>>>>>>>>> >> >>>>
>>>>>>>>> >> >>>> 123 features per rows, and
in average, 89% are zeros.
>>>>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan
Sparks" <
>>>>>>>>> evan.sparks@gmail.com> wrote:
>>>>>>>>> >> >>>>
>>>>>>>>> >> >>>> > What is the number of
non zeroes per row (and number of
>>>>>>>>> features)
>>>>>>>>> >> >>>> > in the
>>>>>>>>> >> >>>> > sparse case? We've hit
some issues with breeze sparse
>>>>>>>>> support in
>>>>>>>>> >> >>>> > the
>>>>>>>>> >> >>>> > past
>>>>>>>>> >> >>>> > but for sufficiently sparse
data it's still pretty good.
>>>>>>>>> >> >>>> >
>>>>>>>>> >> >>>> > > On Apr 23, 2014,
at 9:21 PM, DB Tsai <
>>>>>>>>> dbtsai@stanford.edu> wrote:
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > Hi all,
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > I'm benchmarking
Logistic Regression in MLlib using
>>>>>>>>> the newly
>>>>>>>>> >> >>>> > > added
>>>>>>>>> >> >>>> > optimizer LBFGS and GD.
I'm using the same dataset and
>>>>>>>>> the same
>>>>>>>>> >> >>>> > methodology
>>>>>>>>> >> >>>> > in this paper,
>>>>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > I want to know how
Spark scale while adding workers,
>>>>>>>>> and how
>>>>>>>>> >> >>>> > > optimizers
>>>>>>>>> >> >>>> > and input format (sparse
or dense) impact performance.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > The benchmark code
can be found here,
>>>>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > The first dataset
I benchmarked is a9a which only has
>>>>>>>>> 2.2MB. I
>>>>>>>>> >> >>>> > duplicated the dataset,
and made it 762MB to have 11M
>>>>>>>>> rows. This
>>>>>>>>> >> >>>> > dataset
>>>>>>>>> >> >>>> > has 123 features and 11%
of the data are non-zero
>>>>>>>>> elements.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > In this benchmark,
all the dataset is cached in memory.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > As we expect, LBFGS
converges faster than GD, and at
>>>>>>>>> some point,
>>>>>>>>> >> >>>> > > no
>>>>>>>>> >> >>>> > matter how we push GD,
it will converge slower and
>>>>>>>>> slower.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > However, it's surprising
that sparse format runs
>>>>>>>>> slower than
>>>>>>>>> >> >>>> > > dense
>>>>>>>>> >> >>>> > format. I did see that
sparse format takes significantly
>>>>>>>>> smaller
>>>>>>>>> >> >>>> > amount
>>>>>>>>> >> >>>> > of
>>>>>>>>> >> >>>> > memory in caching RDD,
but sparse is 40% slower than
>>>>>>>>> dense. I think
>>>>>>>>> >> >>>> > sparse
>>>>>>>>> >> >>>> > should be fast since when
we compute x wT, since x is
>>>>>>>>> sparse, we
>>>>>>>>> >> >>>> > can do
>>>>>>>>> >> >>>> > it
>>>>>>>>> >> >>>> > faster. I wonder if there
is anything I'm doing wrong.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > The attachment is
the benchmark result.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > Thanks.
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > Sincerely,
>>>>>>>>> >> >>>> > >
>>>>>>>>> >> >>>> > > DB Tsai
>>>>>>>>> >> >>>> > > -------------------------------------------------------
>>>>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>>>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>> >> >>>> >
>>>>>>>>> >> >>>
>>>>>>>>> >> >>>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message