flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@apache.org>
Subject Re: Quotes in fields of CsvInputFormat
Date Tue, 09 Dec 2014 13:32:48 GMT
With the current implementation, quoted string parsing kicks in, if the
first non-whitespace character of a field is a double quote (just as in
Malte's case). I think this behaviour can be quite unexpected for users.
Wouldn't it be better to make the behaviour of the String parsing more
explicit, i.e., add a switch to dis/enable quoted string parsing. With the
current implementation, the configuration would affect all String fields in
a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels <max@data-artisans.com>:

> Hi Malte,
>
> Typically, double quotes are used to identify strings and thus are not
> interpreted literally. Any data in a field after a double quoted string is
> regarded as invalid trailing data.
>
> You could replace double quotes with single quotes:
>
> A|ggg
> B|'hhh' xx
> C|xxx
>
> This results in the expected >'hhh' xx< for the second line.
>
> Best regards,
> Max
>
> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms@mieo.de> wrote:
>
>> Hi Stephan,
>>
>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>> disabled but there seems to be no method to do that.
>>
>>
>> Malte
>>
>> Von: Stephan Ewen <sewen@apache.org>
>> Antworten an: <user@flink.incubator.apache.org>
>> Datum: Freitag, 5. Dezember 2014 16:28
>> An: <user@flink.incubator.apache.org>
>> Betreff: Re: Quotes in fields of CsvInputFormat
>>
>> Hi!
>>
>> The parser interprets the quotes as quotes for the field. That means the
>> second field (the string) stops after the "hhh" and the xx is considered
>> invalid trailing data.
>>
>> What do you expect as the result of parsing that line?
>>
>> Stephan
>>
>>
>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms@mieo.de> wrote:
>>
>>> Hi,
>>>
>>> I’m try to import a CSV file but the parser seems to have problems this
>>> quotes in the beginning of a field. Is there a way to set or disable
>>> enclosures for the CSV input?
>>>
>>> This is my  code:
>>>
>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>                 .fieldDelimiter('|')
>>>                 .types(String.class, String.class)
>>>
>>> CSV:
>>>
>>> A|ggg
>>> B|"hhh" xx
>>> C|xxx
>>>
>>> As result I’m receiving a ParserException for line B:
>>>
>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>> parsed: 'B|"hhh" xx**‘*
>>>
>>>
>>> Thanks,
>>> Malte
>>>
>>
>>
>

Mime
View raw message