flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timo Walther (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2988) Cannot load DataSet[Row] from CSV file
Date Sun, 29 Nov 2015 15:39:11 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031016#comment-15031016
] 

Timo Walther commented on FLINK-2988:
-------------------------------------

Yes that is a good example for a {{TableSource}} in {{TableEnvironment}}. But maybe it would
also make sense to move Row to flink-core and provide an easy way for reading nullable, variable-length
CSV files in DataSet API as well. Tuples and POJOs are sometimes simply too static. I also
had an use-case with more than 25 columns. Defining a POJO for so many columns is quite cumbersome.

> Cannot load DataSet[Row] from CSV file
> --------------------------------------
>
>                 Key: FLINK-2988
>                 URL: https://issues.apache.org/jira/browse/FLINK-2988
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataSet API, Table API
>    Affects Versions: 0.10.0
>            Reporter: Johann Kovacs
>            Priority: Minor
>
> Tuple classes (Java/Scala both) only have arity up to 25, meaning I cannot load a CSV
file with more than 25 columns directly as a DataSet\[TupleX\[...\]\].
> An alternative to using Tuples is using the Table API's Row class, which allows for arbitrary-length,
arbitrary-type, runtime-supplied schemata (using RowTypeInfo) and index-based access.
> However, trying to load a CSV file as a DataSet\[Row\] yields an exception:
> {code}
> val env = ExecutionEnvironment.createLocalEnvironment()
> val filePath = "../someCsv.csv"
> val typeInfo = new RowTypeInfo(Seq(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO),
Seq("word", "number"))
> val source = env.readCsvFile(filePath)(ClassTag(classOf[Row]), typeInfo)
> println(source.collect())
> {code}
> with someCsv.csv containing:
> {code}
> one,1
> two,2
> {code}
> yields
> {code}
> Exception in thread "main" java.lang.ClassCastException: org.apache.flink.api.table.typeinfo.RowSerializer
cannot be cast to org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase
> 	at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.<init>(ScalaCsvInputFormat.java:46)
> 	at org.apache.flink.api.scala.ExecutionEnvironment.readCsvFile(ExecutionEnvironment.scala:282)
> {code}
> As a user I would like to be able to load a CSV file into a DataSet\[Row\], preferably
having a convenience method to specify the schema (RowTypeInfo), without having to use the
"explicit implicit parameters" syntax and specifying the ClassTag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message