spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-25517) Spark DataFrame option inferSchema="true", dataFormat=MM/dd/yyyy, fails to detect date type from the csv file while reading
Date Tue, 25 Sep 2018 02:38:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-25517:
------------------------------------

    Assignee: Apache Spark

> Spark DataFrame option inferSchema="true", dataFormat=MM/dd/yyyy, fails to detect date
type from the csv file while reading
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25517
>                 URL: https://issues.apache.org/jira/browse/SPARK-25517
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.3.1
>         Environment: Spark 2.3.0
>            Reporter: Manoranjan Kumar
>            Assignee: Apache Spark
>            Priority: Major
>              Labels: easyfix
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> spark.read.format("csv").option("inferSchema", true).option("dateFormat", "MM/dd/yyyy")
fails to detect or infer the date type while reading the csv file having date column in the
specified format(MM/dd/yyyy)
> For example:-
> An employee csv file (employee.csv) has following two sample dummy records (with header):
> emp_id,emp_name,joining_date,emp_age, emp_in_time,emp_salary
> 100,Bradd Pitt,{color:#f6c342}09/25/2018{color},26,{color:#f691b2}09/25/2018 10:12:36{color},10000.00
> 101,Angel Joli,{color:#f6c342}08/20/2018{color},28,{color:#f691b2}08/20/2018 11:32:58{color},12000.00
> when I read the above csv file as dataframe like below: 
> val empDF = spark.read.format("csv").option("inferSchema", true).option("dateFormat","MM/dd/yyyy").option("timestampFormat","MM/dd/yyyy
HH:mm:ss").load(employee.csv)
> empDF.printSchema()
> results/output:
> root
>  |-- emp_id: integer (nullable = true)
>  |-- emp_name: string (nullable = true)
>  |-- {color:#d04437}joining_date: string{color} (nullable = true)
>  |-- emp_age: integer (nullable = true)
>  |-- {color:#d04437}emp_in_time: timestamp{color} (nullable = true)
>  |-- emp_salary: double (nullable = true)
> Please notice above (marked in {color:#d04437}red{color} color) the data type automatically
inferred by spark for joining_date and emp_in_time, for joining_date, it fails to detect as
date type and the type remains as {color:#d04437}string{color} as it is, whereas it detects
well for emp_in_time as {color:#d04437}timestamp{color}
> This was the issue that I struggled with for a complete day, and when I dived deep into
the spark source code, i found the implementation for date type is missing whereas the implementation
for timestamp is present in all its glory.
> I am new to this place (exactly first timer), please get back in case of further information
or live example with running code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message