From reviews-return-614327-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Fri Feb  9 00:08:46 2018
Return-Path: <reviews-return-614327-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id AACA018064F
	for <archive-asf-public@eu.ponee.io>; Fri,  9 Feb 2018 00:08:46 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 9AC4B160C4A; Thu,  8 Feb 2018 23:08:46 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id E2D6A160C5D
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  9 Feb 2018 00:08:45 +0100 (CET)
Received: (qmail 85591 invoked by uid 500); 8 Feb 2018 23:08:45 -0000
Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:reviews-help@spark.apache.org>
List-Unsubscribe: <mailto:reviews-unsubscribe@spark.apache.org>
List-Post: <mailto:reviews@spark.apache.org>
List-Id: <reviews.spark.apache.org>
Delivered-To: mailing list reviews@spark.apache.org
Received: (qmail 85532 invoked by uid 99); 8 Feb 2018 23:08:44 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Feb 2018 23:08:44 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
	id 8C407DFC32; Thu,  8 Feb 2018 23:08:43 +0000 (UTC)
From: tdas <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-20552-spark@git.apache.org>
In-Reply-To: <git-pr-20552-spark@git.apache.org>
Subject: [GitHub] spark pull request #20552: [SPARK-23099][SS] Migrate foreach sink to DataSou...
Content-Type: text/plain
Message-Id: <20180208230843.8C407DFC32@git1-us-west.apache.org>
Date: Thu,  8 Feb 2018 23:08:43 +0000 (UTC)

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20552#discussion_r167076621
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala ---
    @@ -17,52 +17,119 @@
     
     package org.apache.spark.sql.execution.streaming
     
    +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
    +
     import org.apache.spark.TaskContext
    -import org.apache.spark.sql.{DataFrame, Encoder, ForeachWriter}
    -import org.apache.spark.sql.catalyst.encoders.encoderFor
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder}
    +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
    +import org.apache.spark.sql.sources.v2.{DataSourceOptions, StreamWriteSupport}
    +import org.apache.spark.sql.sources.v2.writer._
    +import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter
    +import org.apache.spark.sql.streaming.OutputMode
    +import org.apache.spark.sql.types.StructType
     
    -/**
    - * A [[Sink]] that forwards all data into [[ForeachWriter]] according to the contract defined by
    - * [[ForeachWriter]].
    - *
    - * @param writer The [[ForeachWriter]] to process all data.
    - * @tparam T The expected type of the sink.
    - */
    -class ForeachSink[T : Encoder](writer: ForeachWriter[T]) extends Sink with Serializable {
    -
    -  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    -    // This logic should've been as simple as:
    -    // ```
    -    //   data.as[T].foreachPartition { iter => ... }
    -    // ```
    -    //
    -    // Unfortunately, doing that would just break the incremental planing. The reason is,
    -    // `Dataset.foreachPartition()` would further call `Dataset.rdd()`, but `Dataset.rdd()` will
    -    // create a new plan. Because StreamExecution uses the existing plan to collect metrics and
    -    // update watermark, we should never create a new plan. Otherwise, metrics and watermark are
    -    // updated in the new plan, and StreamExecution cannot retrieval them.
    -    //
    -    // Hence, we need to manually convert internal rows to objects using encoder.
    +
    +case class ForeachWriterProvider[T: Encoder](writer: ForeachWriter[T]) extends StreamWriteSupport {
    +  override def createStreamWriter(
    +      queryId: String,
    +      schema: StructType,
    +      mode: OutputMode,
    +      options: DataSourceOptions): StreamWriter = {
         val encoder = encoderFor[T].resolveAndBind(
    -      data.logicalPlan.output,
    -      data.sparkSession.sessionState.analyzer)
    -    data.queryExecution.toRdd.foreachPartition { iter =>
    -      if (writer.open(TaskContext.getPartitionId(), batchId)) {
    -        try {
    -          while (iter.hasNext) {
    -            writer.process(encoder.fromRow(iter.next()))
    -          }
    -        } catch {
    -          case e: Throwable =>
    -            writer.close(e)
    -            throw e
    -        }
    -        writer.close(null)
    -      } else {
    -        writer.close(null)
    +      schema.toAttributes,
    +      SparkSession.getActiveSession.get.sessionState.analyzer)
    +    ForeachInternalWriter(writer, encoder)
    +  }
    +}
    +
    +case class ForeachInternalWriter[T: Encoder](
    +    writer: ForeachWriter[T], encoder: ExpressionEncoder[T])
    --- End diff --
    
    nit: params on different lines


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org