commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 16:22:03 GMT
On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:

> On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
>
>> 2013/5/1 sebb <sebbaz@gmail.com>
>>
>> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org> wrote:
>> >
>> > > Hi,
>> > >
>> > > I have tried to solve CSV-58 - Escape handling needs rethinking [1] a
>> few
>> > > times over the past weeks now. But I can not see a way to get this
>> > working.
>> > > There are two problems with our current implementation:
>> > >
>> > > 1. a sequence of " test \a test" will be parsed to "test a test",
>> > although
>> > > there is no reason to escape 'a'. The expected behavior is to let the
>> > > sequence \a unchanged.
>> > >
>> >
>> > An alternative would be to throw an Exception.
>> >
>>
>> Haven't thought about that alternative so far. It is was the java compiler
>> would do if it finds "\a" in a String.
>> If we decide to through an exception, the message should state very
>> clearly
>> what has gone wrong, including the position of the misplaced escape char.
>> I'd like to hear more opinions on this.
>>
>>
> Throwing an exception stops the parsing, and obviously only catches the
> first error.
> Leaving the sequences as is means potentially having to scan a large
> output file to find them all.
>
> There's another option: invoke a callback when any invalid sequences are
> detected.
>
> That would allow for various possibilities, depending on what callback was
> provided.
> - throw Exception with details
> - ignore (with optional log)
> - replace with some other sequence (with optional log)
>
> I don't think CSV should do any logging of its own, but the caller can do
> so in the callback
>
> However we would need to decide what the default callback should do.
> If we throw an Exception, then it is obvious that something is wrong.
> If we ignore the sequence (leave as is) then there is no easy way for the
> caller to know if there have been any errors.
>
>
>> >
>> >
>> > > 2. If the escape char is different from backslash (for example '!'), a
>> > > sequence of "test !r test" will be parsed to "test CR test"
>> > >
>> > >
>> > Is that how other CSV parsers behave? i.e. do they always use \ for
>> > escaping EOL?
>> >
>>
>> Haven't tested this. I'll try to have a look this weekend.
>>
>> The problem with the current impl is a backslash as part of an escape
>> sequence.
>> This is what readEscape() currently does: It checks if the lexer has
>> reached a character combination of (for example) \r and replaces this with
>> CR. Otherwise it swallows the escape character and returns the next
>> character.
>>
>> There are two problems with this:
>> 1. it does so if it recognizes the escape character (which doesn't have to
>> be a backslash). This is why !r gets replaced by CR.
>> 2. it swallows the escape and returns the next character (default branch
>> in
>> the switch statement). This should only happen if the next character is
>> one
>> of the special characters from the format (e.g. the delimiter). This is
>> why
>> '\a' gets replaced by 'a'
>>
>
> I think that's wrong, because one cannot distinguish "\a" from a".
>
> The readEscape() method currently returns an int; this cannot be used to
> distinguish between a valid escape and some other character.
> If we only want to support throwing an Exception for invalid escape
> sequence, then it's trivial to fix this - just throw the Exception.
>
> However, to support "as is" passthrough of invalid sequences, the caller
> needs to know when to restore the escape character to the output.
>
> One way to do this would be to change the return type to a char[] array.
> This would either be a single char or the escape followed by a single char.
> The return array could just be appended to the buffer. This should work
> quite well with a call-back, as the callback could also return a char array.
>
> The only downside I can see is that append(char[]) may be slightly slower
> than append(char).
>
>
Found another way to do it: return -1 (EOF) which cannot otherwise be
returned by the method.
The last character is then accessible via in.getLastChar().
i.e. if -1 is returned, output escape followed by lastChar.


Needs some tweaks to readEscape, because \LF etc. stand for themselves,
i.e. both \r and \CR => CR.

What needs to be decided is whether both !r and !CR => CR if the escape
character is ! rather than \, and if so, whether \CR still stands for CR.

But I think the parsing can be tweaked without too much rewriting.


> An alternative is to throw an exception.
>>
>> I think it may be necessary to separate backslash handling from escape
>> handling.
>>
>>
>>
> Possibly. However that may make parsing harder.
>
>
>> >
>> >
>> > > I have tried different approaches to handle this issue but there are
>> > always
>> > > corner cases that I'm unable to cover. I'm getting the feeling that
>> I'm
>> > > just to dump to solve this, so any comments or suggestions would be
>> > really
>> > > appreciated!
>> > >
>> > >
>> > Perhaps it would help to commit the unit tests you are working to solve.
>> > The tests can be disabled.
>> >
>> > Or perhaps better use a class whose name is not one of the ones picked
>> up
>> > by Surefire.
>> > This allows the test to be run independently by Eclipse (and Maven) but
>> > won't mess with CI builds.
>> >
>>
>> I have already committed three tests to CSVLexerTest and set them to be
>> ignored. Maybe you can review those and tell me whether my assumptions are
>> correct.
>>
>>
>> >
>> > Benedikt
>> > > [1] https://issues.apache.org/jira/browse/CSV-58
>> > >
>> > > --
>> > > http://people.apache.org/~britter/
>> > > http://www.systemoutprintln.de/
>> > > http://twitter.com/BenediktRitter
>> > > http://github.com/britter
>> > >
>> >
>>
>>
>>
>> --
>> http://people.apache.org/~britter/
>> http://www.systemoutprintln.de/
>> http://twitter.com/BenediktRitter
>> http://github.com/britter
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message