spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10816) EventTime based sessionization
Date Tue, 16 Oct 2018 09:45:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651404#comment-16651404
] 

Jungtaek Lim commented on SPARK-10816:
--------------------------------------

Update: I've crafted another performance test for testing same query with data pattern.

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816]

 

I've separated packages for both data pattern just for simplicity. Classnames are same.

Data pattern 1: plenty of rows in same session

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_rows_in_session]

Data pattern 2: plenty of sessions

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_sessions]

 

While running benchmark with data pattern 2, I've found some performance hits on my patch
so made some fixes as well. Most of the fixes were reducing the number of codegen: but there's
also a major fix: made pre-merging sessions in local partition being optional. It seriously
harms the performance with data pattern 2.

The patch still lacks with state sub-optimal. I guess it is now the major bottleneck on my
patch, so wrapping my head to find good alternatives. Baidu's list state would be the one
of, since I realized \[3] might put more deltas as well as requires more operations.

> EventTime based sessionization
> ------------------------------
>
>                 Key: SPARK-10816
>                 URL: https://issues.apache.org/jira/browse/SPARK-10816
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Reynold Xin
>            Priority: Major
>         Attachments: SPARK-10816 Support session window natively.pdf, Session Window
Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message