spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: dataframe.foreach VS dataframe.collect().foreach
Date Tue, 26 Jul 2016 13:08:40 GMT
And Pedro has made sense of a world running amok, scared, and drunken
stupor.

Regards,
Gourav

On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
wrote:

> I am not 100% as I haven't tried this out, but there is a huge difference
> between the two. Both foreach and collect are actions irregardless of
> whether or not the data frame is empty.
>
> Doing a collect will bring all the results back to the driver, possibly
> forcing it to run out of memory. Foreach will apply your function to each
> element of the DataFrame, but will do so across the cluster. This behavior
> is useful for when you need to do something custom for each element
> (perhaps save to a db for which there is no driver or something custom like
> make an http request per element, careful here though due to overhead cost).
>
> In your example, I am going to assume that hrecords is something like a
> list buffer. The reason that will be empty is that each worker will get
> sent an empty list (its captured in the closure for foreach) and append to
> it. The instance of the list at the driver doesn't know about what happened
> at the workers so its empty.
>
> I don't know why Chanh's comment applies here since I am guessing the df
> is not empty.
>
> On Tue, Jul 26, 2016 at 1:53 AM, kevin <kiss.kevin119@gmail.com> wrote:
>
>> thank you Chanh
>>
>> 2016-07-26 15:34 GMT+08:00 Chanh Le <giaosudau@gmail.com>:
>>
>>> Hi Ken,
>>>
>>> *blacklistDF -> just DataFrame *
>>> Spark is lazy until you call something like* collect, take, write* it
>>> will execute the hold process *like you do map or filter before you
>>> collect*.
>>> That mean until you call collect spark* do nothing* so you df would not
>>> have any data -> can’t call foreach.
>>> Call collect execute the process -> get data -> foreach is ok.
>>>
>>>
>>> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin119@gmail.com> wrote:
>>>
>>>  blacklistDF.collect()
>>>
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Mime
View raw message