spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marmbrus <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-5182] [SQL] Partitioning support for th...
Date Sat, 09 May 2015 18:41:17 GMT
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5526#discussion_r29995224
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -207,3 +246,218 @@ trait InsertableRelation {
     trait CatalystScan {
       def buildScan(requiredColumns: Seq[Attribute], filters: Seq[Expression]): RDD[Row]
     }
    +
    +/**
    + * ::Experimental::
    + * [[OutputWriter]] is used together with [[FSBasedRelation]] for persisting rows to
the
    + * underlying file system.  Subclasses of [[OutputWriter]] must provide a zero-argument
constructor.
    + * An [[OutputWriter]] instance is created and initialized when a new output file is
opened on
    + * executor side.  This instance is used to persist rows to this single output file.
    + */
    +@Experimental
    +abstract class OutputWriter {
    +  /**
    +   * Initializes this [[OutputWriter]] before any rows are persisted.
    +   *
    +   * @param path Path of the file to which this [[OutputWriter]] is supposed to write.
 Note that
    +   *        this may not point to the final output file.  For example, `FileOutputFormat`
writes to
    +   *        temporary directories and then merge written files back to the final destination.
 In
    +   *        this case, `path` points to a temporary output file under the temporary directory.
    +   * @param dataSchema Schema of the rows to be written. Partition columns are not included
in the
    +   *        schema if the corresponding relation is partitioned.
    +   * @param context The Hadoop MapReduce task context.
    +   */
    +  def init(
    +      path: String,
    +      dataSchema: StructType,
    +      context: TaskAttemptContext): Unit = ()
    +
    +  /**
    +   * Persists a single row.  Invoked on the executor side.  When writing to dynamically
partitioned
    +   * tables, dynamic partition columns are not included in rows to be written.
    +   */
    +  def write(row: Row): Unit
    +
    +  /**
    +   * Closes the [[OutputWriter]]. Invoked on the executor side after all rows are persisted,
before
    +   * the task output is committed.
    +   */
    +  def close(): Unit = ()
    --- End diff --
    
    Should we have a default implementation here?  Or do you think all sources will need to
close?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message