Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A07BA9DDA for ; Tue, 9 Dec 2014 13:34:27 +0000 (UTC) Received: (qmail 22567 invoked by uid 500); 9 Dec 2014 13:34:27 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 22501 invoked by uid 500); 9 Dec 2014 13:34:27 -0000 Mailing-List: contact user-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.incubator.apache.org Delivered-To: mailing list user@flink.incubator.apache.org Received: (qmail 22492 invoked by uid 99); 9 Dec 2014 13:34:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Dec 2014 13:34:27 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 09 Dec 2014 13:34:25 +0000 Received: (qmail 20035 invoked by uid 99); 9 Dec 2014 13:32:50 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Dec 2014 13:32:50 +0000 Received: from mail-yk0-f171.google.com (mail-yk0-f171.google.com [209.85.160.171]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 1147C1A00E3 for ; Tue, 9 Dec 2014 13:32:49 +0000 (UTC) Received: by mail-yk0-f171.google.com with SMTP id 142so238382ykq.16 for ; Tue, 09 Dec 2014 05:32:48 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.170.141.66 with SMTP id i63mr3167176ykc.1.1418131968198; Tue, 09 Dec 2014 05:32:48 -0800 (PST) Received: by 10.170.139.4 with HTTP; Tue, 9 Dec 2014 05:32:48 -0800 (PST) In-Reply-To: References: Date: Tue, 9 Dec 2014 14:32:48 +0100 Message-ID: Subject: Re: Quotes in fields of CsvInputFormat From: Fabian Hueske To: "user@flink.incubator.apache.org" Content-Type: multipart/alternative; boundary=001a113973023590800509c89424 X-Virus-Checked: Checked by ClamAV on apache.org --001a113973023590800509c89424 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though... Cheers, Fabian 2014-12-09 12:17 GMT+01:00 Max Michels : > Hi Malte, > > Typically, double quotes are used to identify strings and thus are not > interpreted literally. Any data in a field after a double quoted string i= s > regarded as invalid trailing data. > > You could replace double quotes with single quotes: > > A|ggg > B|'hhh' xx > C|xxx > > This results in the expected >'hhh' xx< for the second line. > > Best regards, > Max > > On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer wrote: > >> Hi Stephan, >> >> The result should be >"hhh=E2=80=9C xx< as field value. Enclosures shou= ld be >> disabled but there seems to be no method to do that. >> >> >> Malte >> >> Von: Stephan Ewen >> Antworten an: >> Datum: Freitag, 5. Dezember 2014 16:28 >> An: >> Betreff: Re: Quotes in fields of CsvInputFormat >> >> Hi! >> >> The parser interprets the quotes as quotes for the field. That means the >> second field (the string) stops after the "hhh" and the xx is considered >> invalid trailing data. >> >> What do you expect as the result of parsing that line? >> >> Stephan >> >> >> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer wrote: >> >>> Hi, >>> >>> I=E2=80=99m try to import a CSV file but the parser seems to have probl= ems this >>> quotes in the beginning of a field. Is there a way to set or disable >>> enclosures for the CSV input? >>> >>> This is my code: >>> >>> DataSet> res =3D env.readCsvFile(inputCsvFilenam= e) >>> .fieldDelimiter('|') >>> .types(String.class, String.class) >>> >>> CSV: >>> >>> A|ggg >>> B|"hhh" xx >>> C|xxx >>> >>> As result I=E2=80=99m receiving a ParserException for line B: >>> >>> *org.apache.flink.api.common.io.ParseException: Line could not be >>> parsed: 'B|"hhh" xx**=E2=80=98* >>> >>> >>> Thanks, >>> Malte >>> >> >> > --001a113973023590800509c89424 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
With the current implementation, quoted string parsin= g kicks in, if the first non-whitespace character of a field is a double qu= ote (just as in Malte's case). I think this behaviour can be quite unex= pected for users.=C2=A0
Wouldn't it be better to make the= behaviour of the String parsing more explicit, i.e., add a switch to dis/e= nable quoted string parsing. With the current implementation, the configura= tion would affect all String fields in a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels &= lt;max@data-arti= sans.com>:
Hi Malte,

Typically, double quotes are used to identify string= s and thus are not interpreted literally. Any data in a field after a doubl= e quoted string is regarded as invalid trailing data.

You could repl= ace double quotes with single quotes:

A|ggg=
B|'hhh' xx
C|xxx

<= div>This results in the expected >'hhh' xx< for the second li= ne.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms@mieo.de= > wrote:
Von: Stephan Ewen <sewen@apache.org>
Antworten an: <user@flink.incubator.apache.org= >
Datum: Freitag, 5. Dez= ember 2014 16:28
An: <user@flink.i= ncubator.apache.org>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quote= s as quotes for the field. That means the second field (the string) stops a= fter the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that lin= e?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM= , Malte Schwarzer <ms@mieo.de> wrote:
Hi,

I=E2= =80=99m try to import a CSV file but the parser seems to have problems this= quotes in the beginning of a field. Is there a way to set or disable enclo= sures for the CSV input?

This is my =C2=A0code:
<= div style=3D"color:rgb(0,0,0);font-family:Calibri,sans-serif;font-size:14px= ">
DataSet<Tuple2<String, String>> res =3D env= .readCsvFile(inputCsvFilename)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 .fieldDelimiter('|')
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .types(String.class, String.class= )

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result = I=E2=80=99m receiving a ParserException for line B:

= org.apache.flink.api.common.io.ParseException: Line could not be parsed: &= #39;B|"hhh" xx=E2=80=98<= /i>


Thanks,
Malte



--001a113973023590800509c89424--