spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From r...@apache.org
Subject spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options
Date Mon, 02 May 2016 02:05:36 GMT
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a6428292f -> 705172202


[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.

(cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71)
Signed-off-by: Reynold Xin <rxin@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220

Branch: refs/heads/branch-2.0
Commit: 7051722023b98f1720142c7b3b41948d275ea455
Parents: a642829
Author: hyukjinkwon <gurwls223@gmail.com>
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin <rxin@databricks.com>
Committed: Sun May 1 19:05:32 2016 -0700

----------------------------------------------------------------------
 python/pyspark/sql/readwriter.py                | 52 ++++++++++++++++++++
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 ++++++++++++++++--
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
         :param paths: string, or list of strings, for input path(s).
 
+        You can set the following CSV-specific options to deal with CSV files:
+            * ``sep`` (default ``,``): sets the single character as a separator \
+                for each field and value.
+            * ``charset`` (default ``UTF-8``): decodes the CSV files by the given \
+                encoding type.
+            * ``quote`` (default ``"``): sets the single character used for escaping \
+                quoted values where the separator can be part of the value.
+            * ``escape`` (default ``\``): sets the single character used for escaping quotes
\
+                inside an already quoted value.
+            * ``comment`` (default empty string): sets the single character used for skipping
\
+                lines beginning with this character. By default, it is disabled.
+            * ``header`` (default ``false``): uses the first line as names of columns.
+            * ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading
\
+                whitespaces from values being read should be skipped.
+            * ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing
\
+                whitespaces from values being read should be skipped.
+            * ``nullValue`` (default empty string): sets the string representation of a null
value.
+            * ``nanValue`` (default ``NaN``): sets the string representation of a non-number
\
+                value.
+            * ``positiveInf`` (default ``Inf``): sets the string representation of a positive
\
+                infinity value.
+            * ``negativeInf`` (default ``-Inf``): sets the string representation of a negative
\
+                infinity value.
+            * ``dateFormat`` (default ``None``): sets the string that indicates a date format.
\
+                Custom date formats follow the formats at ``java.text.SimpleDateFormat``.
This \
+                applies to both date type and timestamp type. By default, it is None which
means \
+                trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \
+                ``java.sql.Date.valueOf()``.
+            * ``maxColumns`` (default ``20480``): defines a hard limit of how many columns
\
+                a record can have.
+            * ``maxCharsPerColumn`` (default ``1000000``): defines the maximum number of
\
+                characters allowed for any given value being read.
+            * ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records
\
+                during parsing.
+                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted
record. \
+                    When a schema is set by user, it sets ``null`` for extra fields.
+                * ``DROPMALFORMED`` : ignores the whole corrupted records.
+                * ``FAILFAST`` : throws an exception when it meets corrupted records.
+
         >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
         >>> df.dtypes
         [('C0', 'string'), ('C1', 'string')]
@@ -663,6 +702,19 @@ class DataFrameWriter(object):
                             known case-insensitive shorten names (none, bzip2, gzip, lz4,
                             snappy and deflate).
 
+        You can set the following CSV-specific options to deal with CSV files:
+            * ``sep`` (default ``,``): sets the single character as a separator \
+                for each field and value.
+            * ``quote`` (default ``"``): sets the single character used for escaping \
+                quoted values where the separator can be part of the value.
+            * ``escape`` (default ``\``): sets the single character used for escaping quotes
\
+                inside an already quoted value.
+            * ``header`` (default ``false``): writes the names of columns as the first line.
+            * ``nullValue`` (default empty string): sets the string representation of a null
value.
+            * ``compression``: compression codec to use when saving to file. This can be
one of \
+                the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy
and \
+                deflate).
+
         >>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))
         """
         self.mode(mode)

http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 3d43f20..2d4a68f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -290,7 +290,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends
Logging {
    * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
    * (e.g. 00012)</li>
    * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
-   * during parsing.<li>
+   * during parsing.</li>
    * <ul>
    *  <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record,
and puts the
    *  malformed string into a new field configured by `columnNameOfCorruptRecord`. When
@@ -300,7 +300,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends
Logging {
    * </ul>
    * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the
new field
    * having malformed string created by `PERMISSIVE` mode. This overrides
-   * `spark.sql.columnNameOfCorruptRecord`.<li>
+   * `spark.sql.columnNameOfCorruptRecord`.</li>
    *
    * @since 1.4.0
    */
@@ -326,7 +326,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends
Logging {
    * <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting
of all
    * character using backslash quoting mechanism</li>
    * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
-   * during parsing.<li>
+   * during parsing.</li>
    * <ul>
    *  <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record,
and puts the
    *  malformed string into a new field configured by `columnNameOfCorruptRecord`. When
@@ -336,7 +336,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends
Logging {
    * </ul>
    * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the
new field
    * having malformed string created by `PERMISSIVE` mode. This overrides
-   * `spark.sql.columnNameOfCorruptRecord`.<li>
+   * `spark.sql.columnNameOfCorruptRecord`.</li>
    *
    * @since 1.6.0
    */
@@ -393,6 +393,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends
Logging {
    * This function goes through the input once to determine the input schema. To avoid going
    * through the entire data once, specify the schema explicitly using [[schema]].
    *
+   * You can set the following CSV-specific options to deal with CSV files:
+   * <li>`sep` (default `,`): sets the single character as a separator for each
+   * field and value.</li>
+   * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding
+   * type.</li>
+   * <li>`quote` (default `"`): sets the single character used for escaping quoted
values where
+   * the separator can be part of the value.</li>
+   * <li>`escape` (default `\`): sets the single character used for escaping quotes
inside
+   * an already quoted value.</li>
+   * <li>`comment` (default empty string): sets the single character used for skipping
lines
+   * beginning with this character. By default, it is disabled.</li>
+   * <li>`header` (default `false`): uses the first line as names of columns.</li>
+   * <li>`ignoreLeadingWhiteSpace` (default `false`): defines whether or not leading
whitespaces
+   * from values being read should be skipped.</li>
+   * <li>`ignoreTrailingWhiteSpace` (default `fDataFraalse`): defines whether or not
trailing
+   * whitespaces from values being read should be skipped.</li>
+   * <li>`nullValue` (default empty string): sets the string representation of a null
value.</li>
+   * <li>`nanValue` (default `NaN`): sets the string representation of a non-number"
value.</li>
+   * <li>`positiveInf` (default `Inf`): sets the string representation of a positive
infinity
+   * value.</li>
+   * <li>`negativeInf` (default `-Inf`): sets the string representation of a negative
infinity
+   * value.</li>
+   * <li>`dateFormat` (default `null`): sets the string that indicates a date format.
Custom date
+   * formats follow the formats at `java.text.SimpleDateFormat`. This applies to both date
type
+   * and timestamp type. By default, it is `null` which means trying to parse times and date
by
+   * `java.sql.Timestamp.valueOf()` and `java.sql.Date.valueOf()`.</li>
+   * <li>`maxColumns` (default `20480`): defines a hard limit of how many columns
+   * a record can have.</li>
+   * <li>`maxCharsPerColumn` (default `1000000`): defines the maximum number of characters
allowed
+   * for any given value being read.</li>
+   * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
+   *    during parsing.</li>
+   * <ul>
+   *   <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record.
When
+   *     a schema is set by user, it sets `null` for extra fields.</li>
+   *   <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
+   *   <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
+   * </ul>
+   *
    * @since 2.0.0
    */
   @scala.annotation.varargs

http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 28f5ccd..a57d47d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -606,6 +606,14 @@ final class DataFrameWriter private[sql](df: DataFrame) {
    * }}}
    *
    * You can set the following CSV-specific option(s) for writing CSV files:
+   * <li>`sep` (default `,`): sets the single character as a separator for each
+   * field and value.</li>
+   * <li>`quote` (default `"`): sets the single character used for escaping quoted
values where
+   * the separator can be part of the value.</li>
+   * <li>`escape` (default `\`): sets the single character used for escaping quotes
inside
+   * an already quoted value.</li>
+   * <li>`header` (default `false`): writes the names of columns as the first line.</li>
+   * <li>`nullValue` (default empty string): sets the string representation of a null
value.</li>
    * <li>`compression` (default `null`): compression codec to use when saving to file.
This can be
    * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
    * `snappy` and `deflate`). </li>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message