flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Flink CSV parsing
Date Fri, 10 Mar 2017 21:06:21 GMT
Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of  apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message