flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Flink CSV parsing
Date Fri, 10 Mar 2017 22:17:37 GMT
If you already have an idea on how to proceed maybe I can try to take care
of issue a PR using commons-csv or whatever library you prefer

On 10 Mar 2017 22:07, "Fabian Hueske" <fhueske@gmail.com> wrote:

Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of
apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message