spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frank Kemmer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]
Date Wed, 05 Sep 2018 18:11:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Frank Kemmer updated SPARK-25343:
---------------------------------
    Description: 
With the cvs() method it is currenty possible to create a Dataframe from Dataset[String],
where the given string contains comma separated values. This is really great.

But very often we have to parse files where we have to split the values of a line by very
individual value separators and regular expressions. The result is a Dataset[List[String]].
This list corresponds to what you would get, after splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as input especially
given a target schema. The csv parser usually casts the separated values against the schema
and can sort out lines where the values of the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt records which
then should contain the input list of strings as a dumped JSON string.

This is the functionality I am looking for and I think it is already implemented in the CSV
parser.

  was:
With the cvs() method it is currenty possible to create a Dataframe from Dataset[String],
where the given string contains comma separated values. This is really great.

But very often we have to parse files where we have to split the values of a line by very
individual value separators and regular expressions. The result is a Dataset[List[String]].
This list corresponds to what you would get, after splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as input especially
given a target schema. The csv parser usually casts the separated values against the schema
and can sort out lines where the values of the columns do not fit with the schema.

This is the functionality I am looking for and I think it is already implemented in the CSV
parser.


> Extend CSV parsing to Dataset[List[String]]
> -------------------------------------------
>
>                 Key: SPARK-25343
>                 URL: https://issues.apache.org/jira/browse/SPARK-25343
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Frank Kemmer
>            Priority: Minor
>
> With the cvs() method it is currenty possible to create a Dataframe from Dataset[String],
where the given string contains comma separated values. This is really great.
> But very often we have to parse files where we have to split the values of a line by
very individual value separators and regular expressions. The result is a Dataset[List[String]].
This list corresponds to what you would get, after splitting the values of a CSV string.
> It would be great, if the csv() method would also accept such a Dataset as input especially
given a target schema. The csv parser usually casts the separated values against the schema
and can sort out lines where the values of the columns do not fit with the schema.
> This is especially interesting with PERMISSIVE mode and a column for corrupt records
which then should contain the input list of strings as a dumped JSON string.
> This is the functionality I am looking for and I think it is already implemented in the
CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message