spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "PoojaMurarka (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
Date Mon, 03 Dec 2018 22:00:00 GMT
PoojaMurarka created SPARK-26259:
------------------------------------

             Summary: RecordSeparator other than newline discovers incorrect schema
                 Key: SPARK-26259
                 URL: https://issues.apache.org/jira/browse/SPARK-26259
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: PoojaMurarka
             Fix For: 2.4.1


Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3
which allows record Separators other than new line but this doesn't work when schema is not
specified i.e. while inferring the schema

 Let me try to explain this using below data and scenarios:

Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
{noformat}
"dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
   "2012-01-01","0","0","0","0","1","9","9.1","66","0"    "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
*Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly:
{code:java}
val customSchema = StructType(Array(
        StructField("dteday", DateType, true),
        StructField("hr", IntegerType, true),
        StructField("holiday", IntegerType, true),
        StructField("weekday", IntegerType, true),
        StructField("workingday", DateType, true),
        StructField("weathersit", IntegerType, true),
        StructField("temp", IntegerType, true),
        StructField("atemp", DoubleType, true),
        StructField("hum", IntegerType, true),
        StructField("windspeed", IntegerType, true)));

Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
          .option( "header", true )
          .option( "schema", customSchema)
          .option( "sep", "," )
          .load( "input_data.csv" );
{code}
*Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire
data is read as column names.
{code:java}
Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
          .option( "header", true )
          .option( "inferSchema", true)
          .option( "sep", "," )
          .load( "input_data.csv" );
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message