spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nacho García Fernández (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SPARK-23190) Error when infering date columns
Date Tue, 23 Jan 2018 13:32:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nacho García Fernández updated SPARK-23190:
-------------------------------------------
    Description: 
Hi.

I'm trying to read the following file using the spark.sql read utility:

 

 
{code:java}
c1;c2;c3;c4;c5 
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000 
"+0000113.34";"00";"v";"20001022";2000 
"+0000000.";"0";"a";"20120322";2000
{code}
 

I'm doing this in the spark-shell using the following command: 

 
{code:java}
spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter",
";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
{code}
and I'm getting the following schema:

 
{code:java}
root 
 – c1: double (nullable = true)
 – c2: integer (nullable = true)
 – c3: string (nullable = true)
 – c4: integer (nullable = true)
 – c5: integer (nullable = true)
{code}
 

As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this
is due to the order used in the following match clause: 

[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]

 Since my date  consists only of decimal values, it is being infered as Integer.  Would
be correct to change the order in the match clause and to give preference to Timestamps? I
think this is not good in terms of performance, since all the interger values would be tried
to be casted to timestamps, but I also think that the current implementation is not valid
for dates with are only based on digits.

 

 

Thanks in advance.

 

 

 

 

 

 

  was:
Hi.

I'm trying to read the following file using the spark.sql read utility:

```

c1;c2;c3;c4;c5
"+0000000.";"2";"x";"20001122";2000
"-0000010.21";"2";"x";"19991222";2000
"+0000113.34";"00";"v";"20001022";2000
"+0000000.";"0";"a";"20120322";2000

```

I'm doing this in the spark-shell using the following command:

````

 spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter",
";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema

`````

and I'm getting the following schema:

`````

root
 |-- c1: double (nullable = true)
 |-- c2: integer (nullable = true)
 |-- c3: string (nullable = true)
 |-- c4: integer (nullable = true)
 |-- c5: integer (nullable = true)

`````

As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think this
is due to the order used in the following match clause: 

[https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]

 

Since my date  consists only of decimal values, it is being infered as Integer.  Would be
correct to change the order in the match clause and give preference to Timestamps? I think
this is not good in terms of performance, since all the interger values would be tried to
cast to timestamps, but I also think that the current implementation is not valid for dates
with are fully based on digits...

 

 

 

 

 

 


> Error when infering date columns
> --------------------------------
>
>                 Key: SPARK-23190
>                 URL: https://issues.apache.org/jira/browse/SPARK-23190
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1, 2.1.2, 2.2.1
>            Reporter: Nacho García Fernández
>            Priority: Major
>
> Hi.
> I'm trying to read the following file using the spark.sql read utility:
>  
>  
> {code:java}
> c1;c2;c3;c4;c5 
> "+0000000.";"2";"x";"20001122";2000
> "-0000010.21";"2";"x";"19991222";2000 
> "+0000113.34";"00";"v";"20001022";2000 
> "+0000000.";"0";"a";"20120322";2000
> {code}
>  
> I'm doing this in the spark-shell using the following command: 
>  
> {code:java}
> spark.sqlContext.read.option("inferSchema", "true").option("header", "true").option("delimiter",
";").option("timestampFormat","yyyyMMdd").csv("myfile.csv").printSchema
> {code}
> and I'm getting the following schema:
>  
> {code:java}
> root 
>  – c1: double (nullable = true)
>  – c2: integer (nullable = true)
>  – c3: string (nullable = true)
>  – c4: integer (nullable = true)
>  – c5: integer (nullable = true)
> {code}
>  
> As you can see, the column c4 is being infered as Integer, instead of Timestamp. I think
this is due to the order used in the following match clause: 
> [https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L87]
>  Since my date  consists only of decimal values, it is being infered as Integer. 
Would be correct to change the order in the match clause and to give preference to Timestamps?
I think this is not good in terms of performance, since all the interger values would be tried
to be casted to timestamps, but I also think that the current implementation is not valid
for dates with are only based on digits.
>  
>  
> Thanks in advance.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message