spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <>
Subject Re: what is the best way to implement mini batches?
Date Mon, 15 Dec 2014 20:02:22 GMT
I'm a little confused by some of the responses.  It seems like there are
two different issues being discussed here:

1.  How to turn a sequential algorithm into something that works on spark.
Eg deal with the fact that data is split into partitions which are
processed in parallel (though within a partition, data is processed
sequentially).  I'm guessing folks are particularly interested in online
machine learning algos, which often have a point update and a mini batch

2.  How to convert a one-point-at-a-time view of the data and convert it
into a mini batches view of the data.

(2) is pretty straightforward, eg with iterator.grouped (batchSize), or
manually put data into your own buffer etc.  This works for creating mini
batches *within* one partition in the context of spark.

But problem (1) is completely separate, and there is no general solution.
It really depends the specifics of what you're trying to do.

Some of the suggestions on this thread seem like they are basically just
falling back to sequential data processing ... but realllllllly inefficient
sequential processing.  Eg.  It doesn't make sense to do a full scan of
your data with spark, and ignore all the records but the few that are in
the next mini batch.

It's completely reasonable to just sequentially process all the data if
that works for you.  But then it doesn't make sense to use spark, you're
not gaining anything from it.

Hope this helps, apologies if I just misunderstood the other suggested
On Dec 14, 2014 8:35 PM, "Earthson" <> wrote:

> I think it could be done like:
> 1. using mapPartition to randomly drop some partition
> 2. drop some elements randomly(for selected partition)
> 3. calculate gradient step for selected elements
> I don't think fixed step is needed, but fixed step could be done:
> 1. zipWithIndex
> 2. create ShuffleRDD based on the index(eg. using index/10 as key)
> 3. using mapPartition to calculate each bach
> I also have a question:
> Can mini batches run in parallel?
> I think parallel all batches just like a full batch GD in some case.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message