commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedikt Ritter <brit...@apache.org>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 20:32:43 GMT
2013/5/2 Gary Gregory <garydgregory@gmail.com>

> (top-posting cause it's getting hard to find stuff)
>
> >Should CSV unconditionally recognise \r \n etc as CR LF?
> >If so, then they don't need a ! escape beforehand.
>
> There is ONE escape character, so if it is '!', then "\r " is just "\r".


> If we want >1 escape character, then that's a different story. Right now,
> we have one.
>

Sorry guys, I got the whole story wrong, then. I made the assumption that
we always represent CR as \r no matter what escape character we use.

thanks for clarifying.


>
> Gary
>
>
> On Thu, May 2, 2013 at 4:13 PM, sebb <sebbaz@gmail.com> wrote:
>
> > On 2 May 2013 18:26, Benedikt Ritter <britter@apache.org> wrote:
> >
> > > 2013/5/2 sebb <sebbaz@gmail.com>
> > >
> > > > On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:
> > > >
> > > > > On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
> > > > >
> > > > >> 2013/5/1 sebb <sebbaz@gmail.com>
> > > > >>
> > > > >> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org>
> wrote:
> > > > >> >
> > > > >> > > Hi,
> > > > >> > >
> > > > >> > > I have tried to solve CSV-58 - Escape handling needs
> rethinking
> > > [1]
> > > > a
> > > > >> few
> > > > >> > > times over the past weeks now. But I can not see a
way to get
> > this
> > > > >> > working.
> > > > >> > > There are two problems with our current implementation:
> > > > >> > >
> > > > >> > > 1. a sequence of " test \a test" will be parsed to
"test a
> > test",
> > > > >> > although
> > > > >> > > there is no reason to escape 'a'. The expected behavior
is to
> > let
> > > > the
> > > > >> > > sequence \a unchanged.
> > > > >> > >
> > > > >> >
> > > > >> > An alternative would be to throw an Exception.
> > > > >> >
> > > > >>
> > > > >> Haven't thought about that alternative so far. It is was the
java
> > > > compiler
> > > > >> would do if it finds "\a" in a String.
> > > > >> If we decide to through an exception, the message should state
> very
> > > > >> clearly
> > > > >> what has gone wrong, including the position of the misplaced
> escape
> > > > char.
> > > > >> I'd like to hear more opinions on this.
> > > > >>
> > > > >>
> > > > > Throwing an exception stops the parsing, and obviously only catches
> > the
> > > > > first error.
> > > > > Leaving the sequences as is means potentially having to scan a
> large
> > > > > output file to find them all.
> > > > >
> > > > > There's another option: invoke a callback when any invalid
> sequences
> > > are
> > > > > detected.
> > > > >
> > > > > That would allow for various possibilities, depending on what
> > callback
> > > > was
> > > > > provided.
> > > > > - throw Exception with details
> > > > > - ignore (with optional log)
> > > > > - replace with some other sequence (with optional log)
> > > > >
> > > > > I don't think CSV should do any logging of its own, but the caller
> > can
> > > do
> > > > > so in the callback
> > > > >
> > > > > However we would need to decide what the default callback should
> do.
> > > > > If we throw an Exception, then it is obvious that something is
> wrong.
> > > > > If we ignore the sequence (leave as is) then there is no easy way
> for
> > > the
> > > > > caller to know if there have been any errors.
> > > > >
> > > > >
> > > > >> >
> > > > >> >
> > > > >> > > 2. If the escape char is different from backslash (for
example
> > > > '!'), a
> > > > >> > > sequence of "test !r test" will be parsed to "test
CR test"
> > > > >> > >
> > > > >> > >
> > > > >> > Is that how other CSV parsers behave? i.e. do they always
use \
> > for
> > > > >> > escaping EOL?
> > > > >> >
> > > > >>
> > > > >> Haven't tested this. I'll try to have a look this weekend.
> > > > >>
> > > > >> The problem with the current impl is a backslash as part of an
> > escape
> > > > >> sequence.
> > > > >> This is what readEscape() currently does: It checks if the lexer
> has
> > > > >> reached a character combination of (for example) \r and replaces
> > this
> > > > with
> > > > >> CR. Otherwise it swallows the escape character and returns the
> next
> > > > >> character.
> > > > >>
> > > > >> There are two problems with this:
> > > > >> 1. it does so if it recognizes the escape character (which doesn't
> > > have
> > > > to
> > > > >> be a backslash). This is why !r gets replaced by CR.
> > > > >> 2. it swallows the escape and returns the next character (default
> > > branch
> > > > >> in
> > > > >> the switch statement). This should only happen if the next
> character
> > > is
> > > > >> one
> > > > >> of the special characters from the format (e.g. the delimiter).
> This
> > > is
> > > > >> why
> > > > >> '\a' gets replaced by 'a'
> > > > >>
> > > > >
> > > > > I think that's wrong, because one cannot distinguish "\a" from a".
> > > > >
> > > > > The readEscape() method currently returns an int; this cannot be
> used
> > > to
> > > > > distinguish between a valid escape and some other character.
> > > > > If we only want to support throwing an Exception for invalid escape
> > > > > sequence, then it's trivial to fix this - just throw the Exception.
> > > > >
> > > > > However, to support "as is" passthrough of invalid sequences, the
> > > caller
> > > > > needs to know when to restore the escape character to the output.
> > > > >
> > > > > One way to do this would be to change the return type to a char[]
> > > array.
> > > > > This would either be a single char or the escape followed by a
> single
> > > > char.
> > > > > The return array could just be appended to the buffer. This should
> > work
> > > > > quite well with a call-back, as the callback could also return a
> char
> > > > array.
> > > > >
> > > > > The only downside I can see is that append(char[]) may be slightly
> > > slower
> > > > > than append(char).
> > > > >
> > > > >
> > > > Found another way to do it: return -1 (EOF) which cannot otherwise be
> > > > returned by the method.
> > > > The last character is then accessible via in.getLastChar().
> > > > i.e. if -1 is returned, output escape followed by lastChar.
> > > >
> > > >
> > > > Needs some tweaks to readEscape, because \LF etc. stand for
> themselves,
> > > > i.e. both \r and \CR => CR.
> > > >
> > > > What needs to be decided is whether both !r and !CR => CR if the
> escape
> > > > character is ! rather than \, and if so, whether \CR still stands for
> > CR.
> > > >
> > > > But I think the parsing can be tweaked without too much rewriting.
> > > >
> > >
> > > I think there are four cases to cover (I'm using ! as escape in this
> > > example):
> > > 1. !r => nothing to escape, append '!r'
> > >
> >
> > Not sure I agree with that; currently \r => CR if \ is the escape.
> >
> >
> > > 2. !\r => the CR character in its character representation. Replace
> '\r'
> > by
> > > CR and escape it. continue parsing of the token
> > >
> >
> > Not sure I agree here either. I think that's an invalid sequence unless \
> > is one of the meta-characters.
> >
> > Should CSV unconditionally recognise \r \n etc as CR LF?
> > If so, then they don't need a ! escape beforehand.
> >
> >
> > > 3. !CR => esacpe CR and contine parsing of the token
> > >
> >
> > !CR should be replaced by CR, i.e. *un*escaped
> >
> > 4. !\CR => escape '\' if it is one of the characters used in the format,
> in
> > > other words: remove the '!' and append the '\'. Then EOL => next
> record,
> > > but only if CR is the record separator (?).
> > >
> >
> >
> > > I've tried several times and failed. Do you have the time to implement
> > your
> > > proposal? I'll probably have some time this weekend to try once again
> :-)
> > >
> > >
> > Yes, I think I can make a start on amending readEscape.
> >
> > However that won't get us anywhere until the rules are clear; I don't
> think
> > they are currently.
> >
> > It should be possible to round-trip escaping then unescaping.
> >
> > It may be easier to consider all possible escaped output to decide how to
> > unescape, and what is not a valid sequence.
> >
> >
> > >
> > > >
> > > >
> > > > > An alternative is to throw an exception.
> > > > >>
> > > > >> I think it may be necessary to separate backslash handling from
> > escape
> > > > >> handling.
> > > > >>
> > > > >>
> > > > >>
> > > > > Possibly. However that may make parsing harder.
> > > > >
> > > > >
> > > > >> >
> > > > >> >
> > > > >> > > I have tried different approaches to handle this issue
but
> there
> > > are
> > > > >> > always
> > > > >> > > corner cases that I'm unable to cover. I'm getting
the feeling
> > > that
> > > > >> I'm
> > > > >> > > just to dump to solve this, so any comments or suggestions
> would
> > > be
> > > > >> > really
> > > > >> > > appreciated!
> > > > >> > >
> > > > >> > >
> > > > >> > Perhaps it would help to commit the unit tests you are working
> to
> > > > solve.
> > > > >> > The tests can be disabled.
> > > > >> >
> > > > >> > Or perhaps better use a class whose name is not one of the
ones
> > > picked
> > > > >> up
> > > > >> > by Surefire.
> > > > >> > This allows the test to be run independently by Eclipse
(and
> > Maven)
> > > > but
> > > > >> > won't mess with CI builds.
> > > > >> >
> > > > >>
> > > > >> I have already committed three tests to CSVLexerTest and set
them
> to
> > > be
> > > > >> ignored. Maybe you can review those and tell me whether my
> > assumptions
> > > > are
> > > > >> correct.
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > Benedikt
> > > > >> > > [1] https://issues.apache.org/jira/browse/CSV-58
> > > > >> > >
> > > > >> > > --
> > > > >> > > http://people.apache.org/~britter/
> > > > >> > > http://www.systemoutprintln.de/
> > > > >> > > http://twitter.com/BenediktRitter
> > > > >> > > http://github.com/britter
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> http://people.apache.org/~britter/
> > > > >> http://www.systemoutprintln.de/
> > > > >> http://twitter.com/BenediktRitter
> > > > >> http://github.com/britter
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter
> > >
> >
>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<
> http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message