Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 51F4210BC7 for ; Tue, 9 Dec 2014 17:52:39 +0000 (UTC) Received: (qmail 99341 invoked by uid 500); 9 Dec 2014 17:52:39 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 99274 invoked by uid 500); 9 Dec 2014 17:52:39 -0000 Mailing-List: contact user-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.incubator.apache.org Delivered-To: mailing list user@flink.incubator.apache.org Received: (qmail 99264 invoked by uid 99); 9 Dec 2014 17:52:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Dec 2014 17:52:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.160.176] (HELO mail-yk0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Dec 2014 17:52:35 +0000 Received: by mail-yk0-f176.google.com with SMTP id q200so479669ykb.7 for ; Tue, 09 Dec 2014 09:51:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=vEzNV084h+3o4lUZOG0xhqTomVz6guvj9J/8GjLVkkU=; b=XsR5NM6i8aWgOx6WT94g40OIxDzpEn9SQptUEC8mKt7b/KIQrQuTYXZIPTNW0LuE7w k3c6cL3l6gHHD9Jz8vuCniilfdMarYwJRdxtafy/gsgSHDglH9o5Cg82BuoNrQfR3IkZ EMLxElSVTxnSmrdRlREykpmmTvxl4iMSFnIpspITaTvyaqU2KQLA/hXnzspCPgjfWc1a 5pdzBKt51eykt41YaTujuLjwRA5G4pDLHFtEnY6+xgJ9v0nGFAFqx26Mrr8oeQN88Ay3 cXaTL0J7hT2AbuAW1SVGMa9pkYUoVFquAmi12SjFQ3aoE8xM+RCmIGRtaPLH2tWlZQtH GMcw== X-Gm-Message-State: ALoCoQma/JVn2dkId6MFh20exV79DCQ8p3r5PnjfdSg8ZALhKJK6r/TWS3PX3y3QwmZBOBCCFEmD MIME-Version: 1.0 X-Received: by 10.236.32.168 with SMTP id o28mr4215650yha.168.1418147513401; Tue, 09 Dec 2014 09:51:53 -0800 (PST) Received: by 10.170.59.67 with HTTP; Tue, 9 Dec 2014 09:51:53 -0800 (PST) In-Reply-To: References: Date: Tue, 9 Dec 2014 18:51:53 +0100 Message-ID: Subject: Re: Quotes in fields of CsvInputFormat From: Max Michels To: user@flink.incubator.apache.org Content-Type: multipart/alternative; boundary=089e0160ba5ec6ab650509cc32b4 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160ba5ec6ab650509cc32b4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable That sounds like a good idea. Just like setDelimeter("|"), one should be able to do a setParseDoubleQuotes(false) to disable the special handling of double quotes. You're right, Fabian, the current implementation treats all String fields alike. Maybe we can expect the user to provide a consistently formatted input file (i.e. with or without the use of double quotes as identifiers)? On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske wrote: > With the current implementation, quoted string parsing kicks in, if the > first non-whitespace character of a field is a double quote (just as in > Malte's case). I think this behaviour can be quite unexpected for users. > Wouldn't it be better to make the behaviour of the String parsing more > explicit, i.e., add a switch to dis/enable quoted string parsing. With th= e > current implementation, the configuration would affect all String fields = in > a file, though... > > Cheers, Fabian > > 2014-12-09 12:17 GMT+01:00 Max Michels : > >> Hi Malte, >> >> Typically, double quotes are used to identify strings and thus are not >> interpreted literally. Any data in a field after a double quoted string = is >> regarded as invalid trailing data. >> >> You could replace double quotes with single quotes: >> >> A|ggg >> B|'hhh' xx >> C|xxx >> >> This results in the expected >'hhh' xx< for the second line. >> >> Best regards, >> Max >> >> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer wrote: >> >>> Hi Stephan, >>> >>> The result should be >"hhh=E2=80=9C xx< as field value. Enclosures sho= uld be >>> disabled but there seems to be no method to do that. >>> >>> >>> Malte >>> >>> Von: Stephan Ewen >>> Antworten an: >>> Datum: Freitag, 5. Dezember 2014 16:28 >>> An: >>> Betreff: Re: Quotes in fields of CsvInputFormat >>> >>> Hi! >>> >>> The parser interprets the quotes as quotes for the field. That means th= e >>> second field (the string) stops after the "hhh" and the xx is considere= d >>> invalid trailing data. >>> >>> What do you expect as the result of parsing that line? >>> >>> Stephan >>> >>> >>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer wrote: >>> >>>> Hi, >>>> >>>> I=E2=80=99m try to import a CSV file but the parser seems to have prob= lems this >>>> quotes in the beginning of a field. Is there a way to set or disable >>>> enclosures for the CSV input? >>>> >>>> This is my code: >>>> >>>> DataSet> res =3D env.readCsvFile(inputCsvFilena= me) >>>> .fieldDelimiter('|') >>>> .types(String.class, String.class) >>>> >>>> CSV: >>>> >>>> A|ggg >>>> B|"hhh" xx >>>> C|xxx >>>> >>>> As result I=E2=80=99m receiving a ParserException for line B: >>>> >>>> *org.apache.flink.api.common.io.ParseException: Line could not be >>>> parsed: 'B|"hhh" xx**=E2=80=98* >>>> >>>> >>>> Thanks, >>>> Malte >>>> >>> >>> >> > --089e0160ba5ec6ab650509cc32b4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
That sounds like a good idea. Just like setDelimeter(= "|"), one should be able to do a setParseDoubleQuotes(false) to d= isable the special handling of double quotes.

You're = right, Fabian, the current implementation treats all String fields alike. M= aybe we can expect the user to provide a consistently formatted input file = (i.e. with or without the use of double quotes as identifiers)?

On Tue, Dec 9= , 2014 at 2:32 PM, Fabian Hueske <fhueske@apache.org> wrote= :
With the current = implementation, quoted string parsing kicks in, if the first non-whitespace= character of a field is a double quote (just as in Malte's case). I th= ink this behaviour can be quite unexpected for users.=C2=A0
W= ouldn't it be better to make the behaviour of the String parsing more e= xplicit, i.e., add a switch to dis/enable quoted string parsing. With the c= urrent implementation, the configuration would affect all String fields in = a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels &= lt;max@data-arti= sans.com>:
Hi Malte,

Typically, double quotes are used to identify string= s and thus are not interpreted literally. Any data in a field after a doubl= e quoted string is regarded as invalid trailing data.

You could repl= ace double quotes with single quotes:

A|ggg
= B|'hhh' xx
C|xxx

This re= sults in the expected >'hhh' xx< for the second line.

=
Best regards,
Max

On Fri, Dec 5, 2014 at 4= :44 PM, Malte Schwarzer <ms@mieo.de> wrote:
Hi Stephan,

The result should be >"hhh=E2=80=9C xx< =C2=A0as field value. = Enclosures should be disabled but there seems to be no method to do that.= =C2=A0


Malte

Von: Stephan Ewen <se= wen@apache.org>
Antworten an: <user@flink.incubator.apache.org>
Datum: Freitag, 5. Dezember 2014 16:28
An: <user@flink.incubator.apache.org>
Betreff: Re: Quotes in fields of CsvInputFo= rmat

Hi!

<= div>The parser interprets the quotes as quotes for the field. That means th= e second field (the string) stops after the "hhh" and the xx is c= onsidered invalid trailing data.

What do you expec= t as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <= ms@mieo.de> w= rote:
Hi,

I=E2=80=99m try to import a CSV file but the p= arser seems to have problems this quotes in the beginning of a field. Is th= ere a way to set or disable enclosures for the CSV input?

This is my =C2=A0code:

DataSet<Tuple2<= ;String, String>> res =3D env.readCsvFile(inputCsvFilename)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .fieldDelimiter(&#= 39;|')
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xx= x

As result I=E2=80=99m receiving a ParserExcept= ion for line B:

org.apache.flink.api.common.io.Pars= eException: Line could not be parsed: 'B|"hhh" xx=E2=80=98

=

Thanks,
Malte




--089e0160ba5ec6ab650509cc32b4--