beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Groh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-839) The DirectRunner slows down significantly as the number of keys increases
Date Thu, 27 Oct 2016 16:32:58 GMT

    [ https://issues.apache.org/jira/browse/BEAM-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612429#comment-15612429
] 

Thomas Groh commented on BEAM-839:
----------------------------------

The most severe issue is that {{WatermarkManager.SynchronizedProcessingTimeInputWatermark}}
and {{WatermarkManager.PerKeyHolds}} both utilize a {{PriorityQueue}} to track the minimum
hold. Arbitrary holds can be removed as work completes, and {{PriorityQueue#remove}} is an
{{O(n)}} operation, which means the total time for removing {{h}} holds is {{O(h**2)}}. Replacing
the {{PriorityQueue}} with a {{NavigableSet}} will reduce this cost to {{O(h log h)}}.

This is performed in https://github.com/apache/incubator-beam/pull/1202

There are a pair of additional issues that have been added to the description of this issue.

> The DirectRunner slows down significantly as the number of keys increases
> -------------------------------------------------------------------------
>
>                 Key: BEAM-839
>                 URL: https://issues.apache.org/jira/browse/BEAM-839
>             Project: Beam
>          Issue Type: Bug
>            Reporter: Thomas Groh
>            Assignee: Thomas Groh
>
> For example, running WordCount on KingLear takes approximately 10 seconds, while running
WordCount on all of Shakespeare takes approximately 5 minutes. Most of this time is spent
with the transforms unable to make progress, as the time is spent updating the minimum hold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message