spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs
Date Tue, 30 Apr 2019 16:53:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-27593.
----------------------------------
    Resolution: Won't Fix

Malformed column is just an informative field. I don't think we need a special API that returns
two dataframes.

> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> --------------------------------------------------------
>
>                 Key: SPARK-27593
>                 URL: https://issues.apache.org/jira/browse/SPARK-27593
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.4.2
>            Reporter: Ladislav Jech
>            Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to report corrupted
records for audit purposes and feedback back to vendor, so they can enhance their procedure.
CSV is no difference from XSD from perspective that it define a schema although in very limited
way (in some cases only as number of columns without even headers, and we don't have types),
but when I check XML document against XSD file, I get exact report of if the file is completely
valid and if not I get exact report of what records are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into some dataframe,
with line count (pointer within the data object), so I can log both pointer and real data
(line/row) and trigger action on this unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so it is even
harder to detect what is really wrong. Another approach at moment is to read both permissive
and dropmalformed modes into 2 dataframes and compare those one against each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message