commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 13:56:01 GMT
On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:

> 2013/5/1 sebb <sebbaz@gmail.com>
>
> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org> wrote:
> >
> > > Hi,
> > >
> > > I have tried to solve CSV-58 - Escape handling needs rethinking [1] a
> few
> > > times over the past weeks now. But I can not see a way to get this
> > working.
> > > There are two problems with our current implementation:
> > >
> > > 1. a sequence of " test \a test" will be parsed to "test a test",
> > although
> > > there is no reason to escape 'a'. The expected behavior is to let the
> > > sequence \a unchanged.
> > >
> >
> > An alternative would be to throw an Exception.
> >
>
> Haven't thought about that alternative so far. It is was the java compiler
> would do if it finds "\a" in a String.
> If we decide to through an exception, the message should state very clearly
> what has gone wrong, including the position of the misplaced escape char.
> I'd like to hear more opinions on this.
>
>
Throwing an exception stops the parsing, and obviously only catches the
first error.
Leaving the sequences as is means potentially having to scan a large output
file to find them all.

There's another option: invoke a callback when any invalid sequences are
detected.

That would allow for various possibilities, depending on what callback was
provided.
- throw Exception with details
- ignore (with optional log)
- replace with some other sequence (with optional log)

I don't think CSV should do any logging of its own, but the caller can do
so in the callback

However we would need to decide what the default callback should do.
If we throw an Exception, then it is obvious that something is wrong.
If we ignore the sequence (leave as is) then there is no easy way for the
caller to know if there have been any errors.


> >
> >
> > > 2. If the escape char is different from backslash (for example '!'), a
> > > sequence of "test !r test" will be parsed to "test CR test"
> > >
> > >
> > Is that how other CSV parsers behave? i.e. do they always use \ for
> > escaping EOL?
> >
>
> Haven't tested this. I'll try to have a look this weekend.
>
> The problem with the current impl is a backslash as part of an escape
> sequence.
> This is what readEscape() currently does: It checks if the lexer has
> reached a character combination of (for example) \r and replaces this with
> CR. Otherwise it swallows the escape character and returns the next
> character.
>
> There are two problems with this:
> 1. it does so if it recognizes the escape character (which doesn't have to
> be a backslash). This is why !r gets replaced by CR.
> 2. it swallows the escape and returns the next character (default branch in
> the switch statement). This should only happen if the next character is one
> of the special characters from the format (e.g. the delimiter). This is why
> '\a' gets replaced by 'a'
>

I think that's wrong, because one cannot distinguish "\a" from a".

The readEscape() method currently returns an int; this cannot be used to
distinguish between a valid escape and some other character.
If we only want to support throwing an Exception for invalid escape
sequence, then it's trivial to fix this - just throw the Exception.

However, to support "as is" passthrough of invalid sequences, the caller
needs to know when to restore the escape character to the output.

One way to do this would be to change the return type to a char[] array.
This would either be a single char or the escape followed by a single char.
The return array could just be appended to the buffer. This should work
quite well with a call-back, as the callback could also return a char array.

The only downside I can see is that append(char[]) may be slightly slower
than append(char).

An alternative is to throw an exception.
>
> I think it may be necessary to separate backslash handling from escape
> handling.
>
>
>
Possibly. However that may make parsing harder.


> >
> >
> > > I have tried different approaches to handle this issue but there are
> > always
> > > corner cases that I'm unable to cover. I'm getting the feeling that I'm
> > > just to dump to solve this, so any comments or suggestions would be
> > really
> > > appreciated!
> > >
> > >
> > Perhaps it would help to commit the unit tests you are working to solve.
> > The tests can be disabled.
> >
> > Or perhaps better use a class whose name is not one of the ones picked up
> > by Surefire.
> > This allows the test to be run independently by Eclipse (and Maven) but
> > won't mess with CI builds.
> >
>
> I have already committed three tests to CSVLexerTest and set them to be
> ignored. Maybe you can review those and tell me whether my assumptions are
> correct.
>
>
> >
> > Benedikt
> > > [1] https://issues.apache.org/jira/browse/CSV-58
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter
> > >
> >
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message