commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 20:25:06 GMT
(top-posting cause it's getting hard to find stuff)

>Should CSV unconditionally recognise \r \n etc as CR LF?
>If so, then they don't need a ! escape beforehand.

There is ONE escape character, so if it is '!', then "\r " is just "\r".

If we want >1 escape character, then that's a different story. Right now,
we have one.

Gary


On Thu, May 2, 2013 at 4:13 PM, sebb <sebbaz@gmail.com> wrote:

> On 2 May 2013 18:26, Benedikt Ritter <britter@apache.org> wrote:
>
> > 2013/5/2 sebb <sebbaz@gmail.com>
> >
> > > On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:
> > >
> > > > On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
> > > >
> > > >> 2013/5/1 sebb <sebbaz@gmail.com>
> > > >>
> > > >> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org>
wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I have tried to solve CSV-58 - Escape handling needs rethinking
> > [1]
> > > a
> > > >> few
> > > >> > > times over the past weeks now. But I can not see a way to
get
> this
> > > >> > working.
> > > >> > > There are two problems with our current implementation:
> > > >> > >
> > > >> > > 1. a sequence of " test \a test" will be parsed to "test
a
> test",
> > > >> > although
> > > >> > > there is no reason to escape 'a'. The expected behavior
is to
> let
> > > the
> > > >> > > sequence \a unchanged.
> > > >> > >
> > > >> >
> > > >> > An alternative would be to throw an Exception.
> > > >> >
> > > >>
> > > >> Haven't thought about that alternative so far. It is was the java
> > > compiler
> > > >> would do if it finds "\a" in a String.
> > > >> If we decide to through an exception, the message should state very
> > > >> clearly
> > > >> what has gone wrong, including the position of the misplaced escape
> > > char.
> > > >> I'd like to hear more opinions on this.
> > > >>
> > > >>
> > > > Throwing an exception stops the parsing, and obviously only catches
> the
> > > > first error.
> > > > Leaving the sequences as is means potentially having to scan a large
> > > > output file to find them all.
> > > >
> > > > There's another option: invoke a callback when any invalid sequences
> > are
> > > > detected.
> > > >
> > > > That would allow for various possibilities, depending on what
> callback
> > > was
> > > > provided.
> > > > - throw Exception with details
> > > > - ignore (with optional log)
> > > > - replace with some other sequence (with optional log)
> > > >
> > > > I don't think CSV should do any logging of its own, but the caller
> can
> > do
> > > > so in the callback
> > > >
> > > > However we would need to decide what the default callback should do.
> > > > If we throw an Exception, then it is obvious that something is wrong.
> > > > If we ignore the sequence (leave as is) then there is no easy way for
> > the
> > > > caller to know if there have been any errors.
> > > >
> > > >
> > > >> >
> > > >> >
> > > >> > > 2. If the escape char is different from backslash (for example
> > > '!'), a
> > > >> > > sequence of "test !r test" will be parsed to "test CR test"
> > > >> > >
> > > >> > >
> > > >> > Is that how other CSV parsers behave? i.e. do they always use
\
> for
> > > >> > escaping EOL?
> > > >> >
> > > >>
> > > >> Haven't tested this. I'll try to have a look this weekend.
> > > >>
> > > >> The problem with the current impl is a backslash as part of an
> escape
> > > >> sequence.
> > > >> This is what readEscape() currently does: It checks if the lexer has
> > > >> reached a character combination of (for example) \r and replaces
> this
> > > with
> > > >> CR. Otherwise it swallows the escape character and returns the next
> > > >> character.
> > > >>
> > > >> There are two problems with this:
> > > >> 1. it does so if it recognizes the escape character (which doesn't
> > have
> > > to
> > > >> be a backslash). This is why !r gets replaced by CR.
> > > >> 2. it swallows the escape and returns the next character (default
> > branch
> > > >> in
> > > >> the switch statement). This should only happen if the next character
> > is
> > > >> one
> > > >> of the special characters from the format (e.g. the delimiter). This
> > is
> > > >> why
> > > >> '\a' gets replaced by 'a'
> > > >>
> > > >
> > > > I think that's wrong, because one cannot distinguish "\a" from a".
> > > >
> > > > The readEscape() method currently returns an int; this cannot be used
> > to
> > > > distinguish between a valid escape and some other character.
> > > > If we only want to support throwing an Exception for invalid escape
> > > > sequence, then it's trivial to fix this - just throw the Exception.
> > > >
> > > > However, to support "as is" passthrough of invalid sequences, the
> > caller
> > > > needs to know when to restore the escape character to the output.
> > > >
> > > > One way to do this would be to change the return type to a char[]
> > array.
> > > > This would either be a single char or the escape followed by a single
> > > char.
> > > > The return array could just be appended to the buffer. This should
> work
> > > > quite well with a call-back, as the callback could also return a char
> > > array.
> > > >
> > > > The only downside I can see is that append(char[]) may be slightly
> > slower
> > > > than append(char).
> > > >
> > > >
> > > Found another way to do it: return -1 (EOF) which cannot otherwise be
> > > returned by the method.
> > > The last character is then accessible via in.getLastChar().
> > > i.e. if -1 is returned, output escape followed by lastChar.
> > >
> > >
> > > Needs some tweaks to readEscape, because \LF etc. stand for themselves,
> > > i.e. both \r and \CR => CR.
> > >
> > > What needs to be decided is whether both !r and !CR => CR if the escape
> > > character is ! rather than \, and if so, whether \CR still stands for
> CR.
> > >
> > > But I think the parsing can be tweaked without too much rewriting.
> > >
> >
> > I think there are four cases to cover (I'm using ! as escape in this
> > example):
> > 1. !r => nothing to escape, append '!r'
> >
>
> Not sure I agree with that; currently \r => CR if \ is the escape.
>
>
> > 2. !\r => the CR character in its character representation. Replace '\r'
> by
> > CR and escape it. continue parsing of the token
> >
>
> Not sure I agree here either. I think that's an invalid sequence unless \
> is one of the meta-characters.
>
> Should CSV unconditionally recognise \r \n etc as CR LF?
> If so, then they don't need a ! escape beforehand.
>
>
> > 3. !CR => esacpe CR and contine parsing of the token
> >
>
> !CR should be replaced by CR, i.e. *un*escaped
>
> 4. !\CR => escape '\' if it is one of the characters used in the format, in
> > other words: remove the '!' and append the '\'. Then EOL => next record,
> > but only if CR is the record separator (?).
> >
>
>
> > I've tried several times and failed. Do you have the time to implement
> your
> > proposal? I'll probably have some time this weekend to try once again :-)
> >
> >
> Yes, I think I can make a start on amending readEscape.
>
> However that won't get us anywhere until the rules are clear; I don't think
> they are currently.
>
> It should be possible to round-trip escaping then unescaping.
>
> It may be easier to consider all possible escaped output to decide how to
> unescape, and what is not a valid sequence.
>
>
> >
> > >
> > >
> > > > An alternative is to throw an exception.
> > > >>
> > > >> I think it may be necessary to separate backslash handling from
> escape
> > > >> handling.
> > > >>
> > > >>
> > > >>
> > > > Possibly. However that may make parsing harder.
> > > >
> > > >
> > > >> >
> > > >> >
> > > >> > > I have tried different approaches to handle this issue but
there
> > are
> > > >> > always
> > > >> > > corner cases that I'm unable to cover. I'm getting the feeling
> > that
> > > >> I'm
> > > >> > > just to dump to solve this, so any comments or suggestions
would
> > be
> > > >> > really
> > > >> > > appreciated!
> > > >> > >
> > > >> > >
> > > >> > Perhaps it would help to commit the unit tests you are working
to
> > > solve.
> > > >> > The tests can be disabled.
> > > >> >
> > > >> > Or perhaps better use a class whose name is not one of the ones
> > picked
> > > >> up
> > > >> > by Surefire.
> > > >> > This allows the test to be run independently by Eclipse (and
> Maven)
> > > but
> > > >> > won't mess with CI builds.
> > > >> >
> > > >>
> > > >> I have already committed three tests to CSVLexerTest and set them
to
> > be
> > > >> ignored. Maybe you can review those and tell me whether my
> assumptions
> > > are
> > > >> correct.
> > > >>
> > > >>
> > > >> >
> > > >> > Benedikt
> > > >> > > [1] https://issues.apache.org/jira/browse/CSV-58
> > > >> > >
> > > >> > > --
> > > >> > > http://people.apache.org/~britter/
> > > >> > > http://www.systemoutprintln.de/
> > > >> > > http://twitter.com/BenediktRitter
> > > >> > > http://github.com/britter
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> http://people.apache.org/~britter/
> > > >> http://www.systemoutprintln.de/
> > > >> http://twitter.com/BenediktRitter
> > > >> http://github.com/britter
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message