spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] rahulsmahadev commented on a change in pull request #33336: [SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState
Date Tue, 20 Jul 2021 22:49:50 GMT

rahulsmahadev commented on a change in pull request #33336:
URL: https://github.com/apache/spark/pull/33336#discussion_r670821139



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
##########
@@ -404,3 +402,72 @@ case class FlatMapGroupsWithStateExec(
     copy(child = newLeft, initialState = newRight)
 }
 
+object FlatMapGroupsWithStateExec {
+
+  def foundDuplicateInitialKeyException(): Exception = {
+    throw new IllegalArgumentException("The initial state provided contained " +
+      "multiple rows(state) with the same key. Make sure to de-duplicate the " +
+      "initial state before passing it.")
+  }
+
+  /**
+   * Special handling for when the child relation is a batch relation.
+   * If the initial state is provided, we create an instance of the CoGroupExec, if the initial
+   * state is not provided we create an instance of the MapGroupsExec
+   */
+  // scalastyle:off argcount
+  def forBatch(
+      userFunc: (Any, Iterator[Any], LogicalGroupState[Any]) => Iterator[Any],
+      keyDeserializer: Expression,
+      valueDeserializer: Expression,
+      initialStateDeserializer: Expression,
+      groupingAttributes: Seq[Attribute],
+      initialStateGroupAttrs: Seq[Attribute],
+      dataAttributes: Seq[Attribute],
+      initialStateDataAttrs: Seq[Attribute],
+      outputObjAttr: Attribute,
+      timeoutConf: GroupStateTimeout,
+      hasInitialState: Boolean,
+      initialState: SparkPlan,
+      child: SparkPlan): SparkPlan = {
+    if (hasInitialState) {
+      val watermarkPresent = child.output.exists {
+        case a: Attribute if a.metadata.contains(EventTimeWatermark.delayKey) => true
+        case _ => false
+      }
+      val func = (keyRow: Any, values: Iterator[Any], states: Iterator[Any]) => {
+        // Check if there is only one state for every key.
+        var foundInitialStateForKey = false
+        val optionalState = states.map { initialState =>
+          if (foundInitialStateForKey) {
+            foundDuplicateInitialKeyException()
+          }
+          foundInitialStateForKey = true
+          initialState
+        }.toSeq
+
+        // Create group state object
+        val groupState = GroupStateImpl.createForStreaming(

Review comment:
       createForStreaming is a misnomer actually. it is just create a groupStateImpl object
using the parameters https://livegrep.dev.databricks.com/view/databricks/runtime/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/GroupStateImpl.scala#L202

   
   createForbatch is infact creating groupStateImpl without the state
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message