spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aureliano Buendia <>
Subject Re: Python API Performance
Date Sun, 02 Feb 2014 00:03:08 GMT
On Thu, Jan 30, 2014 at 7:51 PM, Evan R. Sparks <>wrote:

> If you just need basic matrix operations - Spark is dependent on JBlas (
> to have access to quick linear
> algebra routines inside of MLlib and graphx. Jblas does a nice job of
> avoiding boxing/unboxing issues when calling out to blas, so it might be
> what you're looking for. The programming patterns you'll be able to support
> with jblas (matrix ops on local partitions) are very similar to what you'd
> get with numpy, etc.

jblas is not the top java matrix library when it comes to performance:

> I agree that the python libraries are more complete/feature rich, but if
> you really crave high performance then I'd recommend staying pure scala and
> giving jblas a try.
> On Thu, Jan 30, 2014 at 8:30 AM, nileshc <> wrote:
>> Hi there,
>> *Background:*
>> I need to do some matrix multiplication stuff inside the mappers, and
>> trying
>> to choose between Python and Scala for writing the Spark MR jobs. I'm
>> equally fluent with Python and Java, and find Scala pretty easy too for
>> what
>> it's worth. Going with Python would let me use numpy + scipy, which is
>> blazing fast when compared to Java libraries like Colt etc. Configuring
>> Java
>> with BLAS seems to be a pain when compared to scipy (direct apt-get
>> installs, or pip).
>> *Question:*
>> I posted a couple of comments on this answer at StackOverflow:
>> .
>> Basically it states that as of Spark 0.7.2, the Python API would be slower
>> than Scala. What's the performance scenario now? The fork issue seems to
>> be
>> fixed. How about serialization? Can it match Java/Scala Writable-like
>> serialization (having knowledge of object type beforehand, reducing I/O)
>> performance? Also, a probably silly question - loops seem to be slow in
>> Python in general, do you think this can turn out to be an issue?
>> Bottomline, should I choose Python for computation-intensive algorithms
>> like
>> PageRank? Scipy gives me an edge, but does the framework kill it?
>> Any help, insights, benchmarks will be much appreciated. :)
>> Cheers,
>> Nilesh
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at

View raw message