spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
Date Mon, 17 Jul 2017 10:00:02 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089581#comment-16089581
] 

Sean Owen commented on SPARK-21401:
-----------------------------------

Scratch that, I get the issue now. We can't remove the queue because that's the efficient
way to keep the top k. 

While it would be most efficient to get the queue's array and sort it directly, there seems
to be no good way to do that. Any change I try involves giving the type A a ClassTag in order
to allocate Arrays, and that causes other changes in the code, like in RDD.scala. The current
way this works involves inefficiently cloning the queue into an array, and that's why it's
slow. So poll() is actually faster, even with the reheaps.

> add poll function for BoundedPriorityQueue
> ------------------------------------------
>
>                 Key: SPARK-21401
>                 URL: https://issues.apache.org/jira/browse/SPARK-21401
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 2.3.0
>            Reporter: Peng Meng
>            Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted value of pq.
There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many usages of PQ
need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message