spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bibudh Lahiri <bibudhlah...@gmail.com>
Subject Can this performance be improved?
Date Thu, 14 Apr 2016 21:21:24 GMT
Hi,
    As part of a larger program, I am extracting the distinct values of
some columns of an RDD with 100 million records and 4 columns. I am running
Spark in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB
1333 MHz DDR3 RAM) with all the 8 cores given to a single worker. So my
statement is something like this:

age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x:
x[1]).distinct()

   It is taking about 3.8 minutes. It is spawning 89 tasks when dealing
with this RDD because (I guess) the block size is 32 MB, and the entire
file is 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time
means it is processing about 50k records per second per core/task.

Does this performance look typical or is there room for improvement?

Thanks
Bibudh



-- 
Bibudh Lahiri
Data Scientist, Impetus Technolgoies
5300 Stevens Creek Blvd
San Jose, CA 95129
http://knowthynumbers.blogspot.com/

Mime
View raw message