spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-10816) EventTime based sessionization
Date Fri, 02 Nov 2018 02:50:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672478#comment-16672478
] 

Jungtaek Lim edited comment on SPARK-10816 at 11/2/18 2:49 AM:
---------------------------------------------------------------

UPDATE: I just discovered the performance critical issue on my patch (I guess same issue is
occurring on Baidu's patch) and just fixed yesterday.

1. Plenty Of Sessions In Key / Append Mode (rate: 500,000)

* HWX, Linked List version: max around 250,000
* HWX, Latest: max around 235,000
* flatMapGroupsWithState (with my state func. implementation): max around 250,000

They're showing CPU being maxed out. (Ran with local[3], and 3 cores of CPU are all 100% user.)
When I increase rate to 1,000,000 I observed max rate would go up to around 350,000 for linked
list version.

2. Plenty Of Keys / Append Mode (rate: 5,000,000)

* HWX, Linked List version: max around 3,700,000
* HWX, Latest: max around 4,000,000
* flatMapGroupsWithState (with my state func. implementation): max around 2,100,000

3. Plenty Of Rows In Session / Append Mode (rate: 50000)

* HWX, Linked List version: max around 27,000
* HWX, Latest: max around 36,000
* flatMapGroupsWithState (with my state func. implementation): max around 26,000

Please note that Linked list version doesn't materialize all sessions (even in given key)
into memory, so it gets rid of concern on loading sessions in memory. It also shows close
or even better performance than manual implementation of flatMapGroupsWithState. Linked list
version sometimes faster than latest version (load all sessions in group memory) and sometimes
slower.

I'll update my patch against linked list version. I guess there's no performance issue now,
now waiting for committers' review.


was (Author: kabhwan):
UPDATE: I just discovered the performance critical issue on my patch (I guess same issue is
occurring on Baidu's patch) and just fixed yesterday.

1. Plenty Of Sessions In Key / Append Mode (rate: 500,000)

* HWX, Linked List version: max around 250,000
* HWX, Latest: max around 235,000
* flatMapGroupsWithState (with my state func. implementation): max around 250,000

They're showing CPU being maxed out. (Ran with local[3], and 3 cores of CPU are all 100% user.)
When I increase rate to 1,000,000 I observed max rate would go up to around 350,000 for linked
list version.

2. Plenty Of Keys / Append Mode (rate: 5,000,000)

* HWX, Linked List version: max around 3,700,000
* HWX, Latest: max around 4,000,000
* flatMapGroupsWithState (with my state func. implementation): max around 2,100,000

3. Plenty Of Rows In Session / Append Mode (rate: 50000)

* HWX, Linked List version: max around 27,000
* HWX, Latest: max around 36,000
* flatMapGroupsWithState (with my state func. implementation): max around 26,000

Please note that Linked list version doesn't materialize all sessions (even in given key)
into memory, so it shows close or even better performance than manual implementation of flatMapGroupsWithState,
as well as get rid of concern on loading sessions in memory. Linked list version sometimes
faster than latest version (load all sessions in group memory) and sometimes slower.

I'll update my patch against linked list version. I guess there's no performance issue now,
now waiting for committers' review.

> EventTime based sessionization
> ------------------------------
>
>                 Key: SPARK-10816
>                 URL: https://issues.apache.org/jira/browse/SPARK-10816
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Reynold Xin
>            Priority: Major
>         Attachments: SPARK-10816 Support session window natively.pdf, Session Window
Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message