spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saba Sehrish <ssehr...@fnal.gov>
Subject Top, takeOrdered, sortByKey
Date Mon, 09 Mar 2015 21:21:39 GMT


From: Saba Sehrish <ssehrish@fnal.gov<mailto:ssehrish@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: <user-faq@spark.apache.org<mailto:user-faq@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in the template
library, and we compare energy of each of the input event with the each of the template event
and return a score. In the end we return best 10000 matches with lowest score. A score of
0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching across all
the events in the template (50k) I see 150-200ms for score calculation on 25 cores (using
YARN cluster), but after that when I perform either a top or takeOrdered or even sortByKey
the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of scores to a list
of top 1000 scores and why sorting or getting best X matches is being dominant by a large
factor. One thing I have noticed is that it doesn’t matter how many elements I return the
time range is the same 25-50s for 10 - 10000 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD<Integer, Double>, and I do something like
numbestmatches is 10, 100, 10000 or any number.

List <Tuple2<Integer, Double>> bestscores_list = scores.takeOrdered(numbestmatches,
new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new
TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();

Mime
View raw message