flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@apache.org>
Subject Re: CsvInputFormat delimiter fields
Date Thu, 16 Oct 2014 09:28:59 GMT
I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway.
Only in cases where the delimiter has a prefix that occurs often in the
regular data, it could have a major impact.

Fabian

2014-10-15 16:07 GMT+02:00 Martin Neumann <mneumann@spotify.com>:

> Would changing it cost performance?
> If not I thing it would be a good change to make since it allows to (ab)use
> the csv reader to load structured Text files (for example by putting
> Keywords as delimiter).
>
> Being able to put a regular expression there would be even nicer but maybe
> it should end up in its own InputFormat then.
>
> cheers Martin
>
> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <sewen@apache.org> wrote:
>
> > Hi!
> >
> > The reason is the current way the csv parsers work. They are pushed into
> > the byte stream parsing and are restricted to recognize one char
> > delimiters. It is possible to change that, but would be a bit of work.
> >
> > Stephan
> >
> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <mneumann@spotify.com>
> > wrote:
> >
> > > Hej,
> > >
> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
> What
> > I
> > > find kind of odd that the Line delimiter is a String but the Field
> > > delimiter is a Character.
> > >
> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > > Path(pVecPath),"\n",'\t',String.class,String.class)
> > >
> > > Is there a reason for this? I'm currently working with a file that has
> a
> > > more complex field delimiter so I had to write a mapper to read from
> > > StringInputFormat.
> > >
> > > cheers Martin
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message