Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7F2D72009C6 for ; Mon, 2 May 2016 04:05:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7D73E1609AD; Mon, 2 May 2016 04:05:38 +0200 (CEST) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7AD2116098E for ; Mon, 2 May 2016 04:05:37 +0200 (CEST) Received: (qmail 90702 invoked by uid 500); 2 May 2016 02:05:36 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list commits@spark.apache.org Received: (qmail 90693 invoked by uid 99); 2 May 2016 02:05:36 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2016 02:05:36 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 8E5ECDFE04; Mon, 2 May 2016 02:05:36 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: rxin@apache.org To: commits@spark.apache.org Message-Id: <4022cd8af79145649b54c726ef5f18ae@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options Date: Mon, 2 May 2016 02:05:36 +0000 (UTC) archived-at: Mon, 02 May 2016 02:05:38 -0000 Repository: spark Updated Branches: refs/heads/branch-2.0 a6428292f -> 705172202 [SPARK-13425][SQL] Documentation for CSV datasource options ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon Author: Hyukjin Kwon Closes #12817 from HyukjinKwon/SPARK-13425. (cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220 Branch: refs/heads/branch-2.0 Commit: 7051722023b98f1720142c7b3b41948d275ea455 Parents: a642829 Author: hyukjinkwon Authored: Sun May 1 19:05:20 2016 -0700 Committer: Reynold Xin Committed: Sun May 1 19:05:32 2016 -0700 ---------------------------------------------------------------------- python/pyspark/sql/readwriter.py | 52 ++++++++++++++++++++ .../org/apache/spark/sql/DataFrameReader.scala | 47 ++++++++++++++++-- .../org/apache/spark/sql/DataFrameWriter.scala | 8 +++ 3 files changed, 103 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index ed9e716..cc5e93d 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -282,6 +282,45 @@ class DataFrameReader(object): :param paths: string, or list of strings, for input path(s). + You can set the following CSV-specific options to deal with CSV files: + * ``sep`` (default ``,``): sets the single character as a separator \ + for each field and value. + * ``charset`` (default ``UTF-8``): decodes the CSV files by the given \ + encoding type. + * ``quote`` (default ``"``): sets the single character used for escaping \ + quoted values where the separator can be part of the value. + * ``escape`` (default ``\``): sets the single character used for escaping quotes \ + inside an already quoted value. + * ``comment`` (default empty string): sets the single character used for skipping \ + lines beginning with this character. By default, it is disabled. + * ``header`` (default ``false``): uses the first line as names of columns. + * ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading \ + whitespaces from values being read should be skipped. + * ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing \ + whitespaces from values being read should be skipped. + * ``nullValue`` (default empty string): sets the string representation of a null value. + * ``nanValue`` (default ``NaN``): sets the string representation of a non-number \ + value. + * ``positiveInf`` (default ``Inf``): sets the string representation of a positive \ + infinity value. + * ``negativeInf`` (default ``-Inf``): sets the string representation of a negative \ + infinity value. + * ``dateFormat`` (default ``None``): sets the string that indicates a date format. \ + Custom date formats follow the formats at ``java.text.SimpleDateFormat``. This \ + applies to both date type and timestamp type. By default, it is None which means \ + trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \ + ``java.sql.Date.valueOf()``. + * ``maxColumns`` (default ``20480``): defines a hard limit of how many columns \ + a record can have. + * ``maxCharsPerColumn`` (default ``1000000``): defines the maximum number of \ + characters allowed for any given value being read. + * ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \ + during parsing. + * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record. \ + When a schema is set by user, it sets ``null`` for extra fields. + * ``DROPMALFORMED`` : ignores the whole corrupted records. + * ``FAILFAST`` : throws an exception when it meets corrupted records. + >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv') >>> df.dtypes [('C0', 'string'), ('C1', 'string')] @@ -663,6 +702,19 @@ class DataFrameWriter(object): known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). + You can set the following CSV-specific options to deal with CSV files: + * ``sep`` (default ``,``): sets the single character as a separator \ + for each field and value. + * ``quote`` (default ``"``): sets the single character used for escaping \ + quoted values where the separator can be part of the value. + * ``escape`` (default ``\``): sets the single character used for escaping quotes \ + inside an already quoted value. + * ``header`` (default ``false``): writes the names of columns as the first line. + * ``nullValue`` (default empty string): sets the string representation of a null value. + * ``compression``: compression codec to use when saving to file. This can be one of \ + the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and \ + deflate). + >>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data')) """ self.mode(mode) http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 3d43f20..2d4a68f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -290,7 +290,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { *
  • `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers * (e.g. 00012)
  • *
  • `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing.
  • + * during parsing.
  • *
      *
    • `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the * malformed string into a new field configured by `columnNameOfCorruptRecord`. When @@ -300,7 +300,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { *
    *
  • `columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field * having malformed string created by `PERMISSIVE` mode. This overrides - * `spark.sql.columnNameOfCorruptRecord`.
  • + * `spark.sql.columnNameOfCorruptRecord`.
  • * * @since 1.4.0 */ @@ -326,7 +326,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { *
  • `allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all * character using backslash quoting mechanism
  • *
  • `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing.
  • + * during parsing.
  • *
      *
    • `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the * malformed string into a new field configured by `columnNameOfCorruptRecord`. When @@ -336,7 +336,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { *
    *
  • `columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field * having malformed string created by `PERMISSIVE` mode. This overrides - * `spark.sql.columnNameOfCorruptRecord`.
  • + * `spark.sql.columnNameOfCorruptRecord`.
  • * * @since 1.6.0 */ @@ -393,6 +393,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * This function goes through the input once to determine the input schema. To avoid going * through the entire data once, specify the schema explicitly using [[schema]]. * + * You can set the following CSV-specific options to deal with CSV files: + *
  • `sep` (default `,`): sets the single character as a separator for each + * field and value.
  • + *
  • `encoding` (default `UTF-8`): decodes the CSV files by the given encoding + * type.
  • + *
  • `quote` (default `"`): sets the single character used for escaping quoted values where + * the separator can be part of the value.
  • + *
  • `escape` (default `\`): sets the single character used for escaping quotes inside + * an already quoted value.
  • + *
  • `comment` (default empty string): sets the single character used for skipping lines + * beginning with this character. By default, it is disabled.
  • + *
  • `header` (default `false`): uses the first line as names of columns.
  • + *
  • `ignoreLeadingWhiteSpace` (default `false`): defines whether or not leading whitespaces + * from values being read should be skipped.
  • + *
  • `ignoreTrailingWhiteSpace` (default `fDataFraalse`): defines whether or not trailing + * whitespaces from values being read should be skipped.
  • + *
  • `nullValue` (default empty string): sets the string representation of a null value.
  • + *
  • `nanValue` (default `NaN`): sets the string representation of a non-number" value.
  • + *
  • `positiveInf` (default `Inf`): sets the string representation of a positive infinity + * value.
  • + *
  • `negativeInf` (default `-Inf`): sets the string representation of a negative infinity + * value.
  • + *
  • `dateFormat` (default `null`): sets the string that indicates a date format. Custom date + * formats follow the formats at `java.text.SimpleDateFormat`. This applies to both date type + * and timestamp type. By default, it is `null` which means trying to parse times and date by + * `java.sql.Timestamp.valueOf()` and `java.sql.Date.valueOf()`.
  • + *
  • `maxColumns` (default `20480`): defines a hard limit of how many columns + * a record can have.
  • + *
  • `maxCharsPerColumn` (default `1000000`): defines the maximum number of characters allowed + * for any given value being read.
  • + *
  • `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records + * during parsing.
  • + *
      + *
    • `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When + * a schema is set by user, it sets `null` for extra fields.
    • + *
    • `DROPMALFORMED` : ignores the whole corrupted records.
    • + *
    • `FAILFAST` : throws an exception when it meets corrupted records.
    • + *
    + * * @since 2.0.0 */ @scala.annotation.varargs http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index 28f5ccd..a57d47d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -606,6 +606,14 @@ final class DataFrameWriter private[sql](df: DataFrame) { * }}} * * You can set the following CSV-specific option(s) for writing CSV files: + *
  • `sep` (default `,`): sets the single character as a separator for each + * field and value.
  • + *
  • `quote` (default `"`): sets the single character used for escaping quoted values where + * the separator can be part of the value.
  • + *
  • `escape` (default `\`): sets the single character used for escaping quotes inside + * an already quoted value.
  • + *
  • `header` (default `false`): writes the names of columns as the first line.
  • + *
  • `nullValue` (default empty string): sets the string representation of a null value.
  • *
  • `compression` (default `null`): compression codec to use when saving to file. This can be * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, * `snappy` and `deflate`).
  • --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org For additional commands, e-mail: commits-help@spark.apache.org