flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Michels <...@data-artisans.com>
Subject Re: Quotes in fields of CsvInputFormat
Date Tue, 09 Dec 2014 17:51:53 GMT
That sounds like a good idea. Just like setDelimeter("|"), one should be
able to do a setParseDoubleQuotes(false) to disable the special handling of
double quotes.

You're right, Fabian, the current implementation treats all String fields
alike. Maybe we can expect the user to provide a consistently formatted
input file (i.e. with or without the use of double quotes as identifiers)?

On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <fhueske@apache.org> wrote:

> With the current implementation, quoted string parsing kicks in, if the
> first non-whitespace character of a field is a double quote (just as in
> Malte's case). I think this behaviour can be quite unexpected for users.
> Wouldn't it be better to make the behaviour of the String parsing more
> explicit, i.e., add a switch to dis/enable quoted string parsing. With the
> current implementation, the configuration would affect all String fields in
> a file, though...
>
> Cheers, Fabian
>
> 2014-12-09 12:17 GMT+01:00 Max Michels <max@data-artisans.com>:
>
>> Hi Malte,
>>
>> Typically, double quotes are used to identify strings and thus are not
>> interpreted literally. Any data in a field after a double quoted string is
>> regarded as invalid trailing data.
>>
>> You could replace double quotes with single quotes:
>>
>> A|ggg
>> B|'hhh' xx
>> C|xxx
>>
>> This results in the expected >'hhh' xx< for the second line.
>>
>> Best regards,
>> Max
>>
>> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms@mieo.de> wrote:
>>
>>> Hi Stephan,
>>>
>>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>>> disabled but there seems to be no method to do that.
>>>
>>>
>>> Malte
>>>
>>> Von: Stephan Ewen <sewen@apache.org>
>>> Antworten an: <user@flink.incubator.apache.org>
>>> Datum: Freitag, 5. Dezember 2014 16:28
>>> An: <user@flink.incubator.apache.org>
>>> Betreff: Re: Quotes in fields of CsvInputFormat
>>>
>>> Hi!
>>>
>>> The parser interprets the quotes as quotes for the field. That means the
>>> second field (the string) stops after the "hhh" and the xx is considered
>>> invalid trailing data.
>>>
>>> What do you expect as the result of parsing that line?
>>>
>>> Stephan
>>>
>>>
>>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms@mieo.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> I’m try to import a CSV file but the parser seems to have problems this
>>>> quotes in the beginning of a field. Is there a way to set or disable
>>>> enclosures for the CSV input?
>>>>
>>>> This is my  code:
>>>>
>>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>>                 .fieldDelimiter('|')
>>>>                 .types(String.class, String.class)
>>>>
>>>> CSV:
>>>>
>>>> A|ggg
>>>> B|"hhh" xx
>>>> C|xxx
>>>>
>>>> As result I’m receiving a ParserException for line B:
>>>>
>>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>>> parsed: 'B|"hhh" xx**‘*
>>>>
>>>>
>>>> Thanks,
>>>> Malte
>>>>
>>>
>>>
>>
>

Mime
View raw message