flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: CSV input with unknown # of fields and Custom output format
Date Tue, 03 Feb 2015 23:07:38 GMT

Flink provides a CsvInputFormat which returns Tuples of the parsed fields.
The format can be configured in several ways (which fields to read,
line/field delimiters, comment prefixes, ...) The CsvInputFormat expects
that a file has a consistent format. You could configure the format for
each file format that you need to read and convert their output to your
common type using a Map function.
If the format of the files is not known ahead, you should implement your
own format (probably based on the DelimitedInputFormat).

Flink provides a SerializedOutputFormat that writes data in the binary
representation that is also used during processing, e.g., for network
transfer and disk spilling. The data can be later read using the

Best regards, Fabian

2015-02-03 22:31 GMT+01:00 Vinh June <hoangthevinh.htv@gmail.com>:

> Hi Flinkers,
> I am totally new to Flink and Scala. I am trying to study Flink in Scala
> for
> a project in university and ran into 2 problems, it would be great if you
> guys can give me any advice.
> #1 problem is that I want to read CSV files with varied fields (different
> names and number of fields), for example:
> file 1: id, name, age
> file 2: id, name, [unknown1], [unknown2]
> expected result set: id, name, age, [unknown1], [unknown2]
> Currently I read each file as Array, then map the array to a common class
> with Map[header, value] (since I will need to know which value belongs to
> which header)
> With this method I ran into #2 problem with output format
> #2 I would like to store binary info, for example, for class
> DataSet[MyClass[id: Long, array: Array[String]]] to read them later. I
> found
> FileOutputFormat might be the solution, but I can't find any example of how
> to define one in Scala
> --
> View this message in context:
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/CSV-input-with-unknown-of-fields-and-Custom-output-format-tp670.html
> Sent from the Apache Flink (Incubator) User Mailing List archive. mailing
> list archive at Nabble.com.

View raw message