commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 20:13:47 GMT
On 2 May 2013 18:26, Benedikt Ritter <britter@apache.org> wrote:

> 2013/5/2 sebb <sebbaz@gmail.com>
>
> > On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:
> >
> > > On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
> > >
> > >> 2013/5/1 sebb <sebbaz@gmail.com>
> > >>
> > >> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I have tried to solve CSV-58 - Escape handling needs rethinking
> [1]
> > a
> > >> few
> > >> > > times over the past weeks now. But I can not see a way to get
this
> > >> > working.
> > >> > > There are two problems with our current implementation:
> > >> > >
> > >> > > 1. a sequence of " test \a test" will be parsed to "test a test",
> > >> > although
> > >> > > there is no reason to escape 'a'. The expected behavior is to
let
> > the
> > >> > > sequence \a unchanged.
> > >> > >
> > >> >
> > >> > An alternative would be to throw an Exception.
> > >> >
> > >>
> > >> Haven't thought about that alternative so far. It is was the java
> > compiler
> > >> would do if it finds "\a" in a String.
> > >> If we decide to through an exception, the message should state very
> > >> clearly
> > >> what has gone wrong, including the position of the misplaced escape
> > char.
> > >> I'd like to hear more opinions on this.
> > >>
> > >>
> > > Throwing an exception stops the parsing, and obviously only catches the
> > > first error.
> > > Leaving the sequences as is means potentially having to scan a large
> > > output file to find them all.
> > >
> > > There's another option: invoke a callback when any invalid sequences
> are
> > > detected.
> > >
> > > That would allow for various possibilities, depending on what callback
> > was
> > > provided.
> > > - throw Exception with details
> > > - ignore (with optional log)
> > > - replace with some other sequence (with optional log)
> > >
> > > I don't think CSV should do any logging of its own, but the caller can
> do
> > > so in the callback
> > >
> > > However we would need to decide what the default callback should do.
> > > If we throw an Exception, then it is obvious that something is wrong.
> > > If we ignore the sequence (leave as is) then there is no easy way for
> the
> > > caller to know if there have been any errors.
> > >
> > >
> > >> >
> > >> >
> > >> > > 2. If the escape char is different from backslash (for example
> > '!'), a
> > >> > > sequence of "test !r test" will be parsed to "test CR test"
> > >> > >
> > >> > >
> > >> > Is that how other CSV parsers behave? i.e. do they always use \ for
> > >> > escaping EOL?
> > >> >
> > >>
> > >> Haven't tested this. I'll try to have a look this weekend.
> > >>
> > >> The problem with the current impl is a backslash as part of an escape
> > >> sequence.
> > >> This is what readEscape() currently does: It checks if the lexer has
> > >> reached a character combination of (for example) \r and replaces this
> > with
> > >> CR. Otherwise it swallows the escape character and returns the next
> > >> character.
> > >>
> > >> There are two problems with this:
> > >> 1. it does so if it recognizes the escape character (which doesn't
> have
> > to
> > >> be a backslash). This is why !r gets replaced by CR.
> > >> 2. it swallows the escape and returns the next character (default
> branch
> > >> in
> > >> the switch statement). This should only happen if the next character
> is
> > >> one
> > >> of the special characters from the format (e.g. the delimiter). This
> is
> > >> why
> > >> '\a' gets replaced by 'a'
> > >>
> > >
> > > I think that's wrong, because one cannot distinguish "\a" from a".
> > >
> > > The readEscape() method currently returns an int; this cannot be used
> to
> > > distinguish between a valid escape and some other character.
> > > If we only want to support throwing an Exception for invalid escape
> > > sequence, then it's trivial to fix this - just throw the Exception.
> > >
> > > However, to support "as is" passthrough of invalid sequences, the
> caller
> > > needs to know when to restore the escape character to the output.
> > >
> > > One way to do this would be to change the return type to a char[]
> array.
> > > This would either be a single char or the escape followed by a single
> > char.
> > > The return array could just be appended to the buffer. This should work
> > > quite well with a call-back, as the callback could also return a char
> > array.
> > >
> > > The only downside I can see is that append(char[]) may be slightly
> slower
> > > than append(char).
> > >
> > >
> > Found another way to do it: return -1 (EOF) which cannot otherwise be
> > returned by the method.
> > The last character is then accessible via in.getLastChar().
> > i.e. if -1 is returned, output escape followed by lastChar.
> >
> >
> > Needs some tweaks to readEscape, because \LF etc. stand for themselves,
> > i.e. both \r and \CR => CR.
> >
> > What needs to be decided is whether both !r and !CR => CR if the escape
> > character is ! rather than \, and if so, whether \CR still stands for CR.
> >
> > But I think the parsing can be tweaked without too much rewriting.
> >
>
> I think there are four cases to cover (I'm using ! as escape in this
> example):
> 1. !r => nothing to escape, append '!r'
>

Not sure I agree with that; currently \r => CR if \ is the escape.


> 2. !\r => the CR character in its character representation. Replace '\r' by
> CR and escape it. continue parsing of the token
>

Not sure I agree here either. I think that's an invalid sequence unless \
is one of the meta-characters.

Should CSV unconditionally recognise \r \n etc as CR LF?
If so, then they don't need a ! escape beforehand.


> 3. !CR => esacpe CR and contine parsing of the token
>

!CR should be replaced by CR, i.e. *un*escaped

4. !\CR => escape '\' if it is one of the characters used in the format, in
> other words: remove the '!' and append the '\'. Then EOL => next record,
> but only if CR is the record separator (?).
>


> I've tried several times and failed. Do you have the time to implement your
> proposal? I'll probably have some time this weekend to try once again :-)
>
>
Yes, I think I can make a start on amending readEscape.

However that won't get us anywhere until the rules are clear; I don't think
they are currently.

It should be possible to round-trip escaping then unescaping.

It may be easier to consider all possible escaped output to decide how to
unescape, and what is not a valid sequence.


>
> >
> >
> > > An alternative is to throw an exception.
> > >>
> > >> I think it may be necessary to separate backslash handling from escape
> > >> handling.
> > >>
> > >>
> > >>
> > > Possibly. However that may make parsing harder.
> > >
> > >
> > >> >
> > >> >
> > >> > > I have tried different approaches to handle this issue but there
> are
> > >> > always
> > >> > > corner cases that I'm unable to cover. I'm getting the feeling
> that
> > >> I'm
> > >> > > just to dump to solve this, so any comments or suggestions would
> be
> > >> > really
> > >> > > appreciated!
> > >> > >
> > >> > >
> > >> > Perhaps it would help to commit the unit tests you are working to
> > solve.
> > >> > The tests can be disabled.
> > >> >
> > >> > Or perhaps better use a class whose name is not one of the ones
> picked
> > >> up
> > >> > by Surefire.
> > >> > This allows the test to be run independently by Eclipse (and Maven)
> > but
> > >> > won't mess with CI builds.
> > >> >
> > >>
> > >> I have already committed three tests to CSVLexerTest and set them to
> be
> > >> ignored. Maybe you can review those and tell me whether my assumptions
> > are
> > >> correct.
> > >>
> > >>
> > >> >
> > >> > Benedikt
> > >> > > [1] https://issues.apache.org/jira/browse/CSV-58
> > >> > >
> > >> > > --
> > >> > > http://people.apache.org/~britter/
> > >> > > http://www.systemoutprintln.de/
> > >> > > http://twitter.com/BenediktRitter
> > >> > > http://github.com/britter
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> http://people.apache.org/~britter/
> > >> http://www.systemoutprintln.de/
> > >> http://twitter.com/BenediktRitter
> > >> http://github.com/britter
> > >>
> > >
> > >
> >
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message