spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Which one is faster / consumes less memory: collect() or count()?
Date Thu, 26 Feb 2015 15:20:41 GMT
Yea we discussed this on the list a short while ago. The extra
overhead of count() is pretty minimal. Still you could wrap this up as
a utility method. There was even a proposal to add some 'materialize'
method to RDD.

PS you can make your Java a little less verbose by omitting "throws
Exception" and "return;"

On Thu, Feb 26, 2015 at 3:07 PM, Emre Sevinc <emre.sevinc@gmail.com> wrote:
> Hello Sean,
>
> Thank you for your advice. Based on your suggestion, I've modified the code
> into the following (and once again admired the easy (!) verbosity of Java
> compared to 'complex and hard to understand' brevity (!) of Scala):
>
> javaDStream.foreachRDD(
>             new Function<JavaRDD<String>, Void>() {
>               @Override
>               public Void call(JavaRDD<String> stringJavaRDD) throws
> Exception {
>                 stringJavaRDD.foreachPartition(
>                         new VoidFunction<Iterator<String>>() {
>                           @Override
>                           public void call(Iterator<String> iteratorString)
> {
>                             return;
>                           }
>                         }
>                 );
>
>         return null;
>       }
>     });
>
>
> I've tested the above in my application, and also observed it with Visual VM
> but could not see a dramatic speed difference (and small heap usage
> difference) compared to my initial version where I just use .count() in a
> foreachRDD block.
>
> Nevertheless I'll make more experiments to see if differences come up in
> terms of speed/memory.
>
> Kind regards,
>
> Emre Sevinç
> http://www.bigindustries.be/
>
>
>
>
>
> On Thu, Feb 26, 2015 at 2:34 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> Those do quite different things. One counts the data; the other copies
>> all of the data to the driver.
>>
>> The fastest way to materialize an RDD that I know of is
>> foreachPartition(i => None)  (or equivalent no-op VoidFunction in
>> Java)
>>
>> On Thu, Feb 26, 2015 at 1:28 PM, Emre Sevinc <emre.sevinc@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a piece of code to force the materialization of RDDs in my Spark
>> > Streaming program, and I'm trying to understand which method is faster
>> > and
>> > has less memory consumption:
>> >
>> >   javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
>> >       @Override
>> >       public Void call(JavaRDD<String> stringJavaRDD) throws Exception
{
>> >
>> >         //stringJavaRDD.collect();
>> >
>> >        // or count?
>> >
>> >         //stringJavaRDD.count();
>> >
>> >         return null;
>> >       }
>> >     });
>> >
>> >
>> > I've checked the source code of Spark at
>> >
>> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala,
>> > and see that collect() is defined as:
>> >
>> >   def collect(): Array[T] = {
>> >     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>> >    Array.concat(results: _*)
>> >   }
>> >
>> > and count() defined as:
>> >
>> >   def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
>> >
>> > Therefore I think calling the count() method is faster and/or consumes
>> > less
>> > memory, but I wanted to be sure.
>> >
>> > Anyone cares to comment?
>> >
>> >
>> > --
>> > Emre Sevinç
>> > http://www.bigindustries.be/
>> >
>
>
>
>
> --
> Emre Sevinc

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message