commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedikt Ritter <brit...@apache.org>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 08:48:13 GMT
2013/5/1 sebb <sebbaz@gmail.com>

> On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org> wrote:
>
> > Hi,
> >
> > I have tried to solve CSV-58 - Escape handling needs rethinking [1] a few
> > times over the past weeks now. But I can not see a way to get this
> working.
> > There are two problems with our current implementation:
> >
> > 1. a sequence of " test \a test" will be parsed to "test a test",
> although
> > there is no reason to escape 'a'. The expected behavior is to let the
> > sequence \a unchanged.
> >
>
> An alternative would be to throw an Exception.
>

Haven't thought about that alternative so far. It is was the java compiler
would do if it finds "\a" in a String.
If we decide to through an exception, the message should state very clearly
what has gone wrong, including the position of the misplaced escape char.
I'd like to hear more opinions on this.


>
>
> > 2. If the escape char is different from backslash (for example '!'), a
> > sequence of "test !r test" will be parsed to "test CR test"
> >
> >
> Is that how other CSV parsers behave? i.e. do they always use \ for
> escaping EOL?
>

Haven't tested this. I'll try to have a look this weekend.

The problem with the current impl is a backslash as part of an escape
sequence.
This is what readEscape() currently does: It checks if the lexer has
reached a character combination of (for example) \r and replaces this with
CR. Otherwise it swallows the escape character and returns the next
character.

There are two problems with this:
1. it does so if it recognizes the escape character (which doesn't have to
be a backslash). This is why !r gets replaced by CR.
2. it swallows the escape and returns the next character (default branch in
the switch statement). This should only happen if the next character is one
of the special characters from the format (e.g. the delimiter). This is why
'\a' gets replaced by 'a'
An alternative is to throw an exception.

I think it may be necessary to separate backslash handling from escape
handling.


>
>
> > I have tried different approaches to handle this issue but there are
> always
> > corner cases that I'm unable to cover. I'm getting the feeling that I'm
> > just to dump to solve this, so any comments or suggestions would be
> really
> > appreciated!
> >
> >
> Perhaps it would help to commit the unit tests you are working to solve.
> The tests can be disabled.
>
> Or perhaps better use a class whose name is not one of the ones picked up
> by Surefire.
> This allows the test to be run independently by Eclipse (and Maven) but
> won't mess with CI builds.
>

I have already committed three tests to CSVLexerTest and set them to be
ignored. Maybe you can review those and tell me whether my assumptions are
correct.


>
> Benedikt
> > [1] https://issues.apache.org/jira/browse/CSV-58
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message