Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Developers List" <dev@commons.apache.org>
Received-SPF: pass (nike.apache.org: domain of sebbaz@gmail.com designates
 74.125.82.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAB917RLvJy2LyxT_bpFBjXVW0xYd5nwLCKpCfp9m9EZWYfUnDA@mail.gmail.com>
References: 
 <CAB917RLiRG5yo+66yoNumqoERaxBgJ7hGkw7xwvbAwDJZvKFMQ@mail.gmail.com>
	<CAOGo0VbJW+b9xp9ZVs1e=Ydn8zwWoe85ugyKk3kW099OGATp0g@mail.gmail.com>
	<CAB917RK2mSUR9HtJW=KmxEgTRTDp-G_1fMsWdwa3StO=_haX1Q@mail.gmail.com>
	<CAOGo0VaSpJp3eoYNM3v55P4aDKnOQgo61qw=wwFHSPyFxQRtKw@mail.gmail.com>
	<CAOGo0VYjznfEFuo4TbjsgsihzBZfa-=x6FrsLZ1doeXgG7HE8Q@mail.gmail.com>
	<CAB917R+sb2rdxhZ=JwQ_Dc46WPGDh_MbM4UAmMkOd6=8-kV4cQ@mail.gmail.com>
	<CAOGo0Va78PDLUhwA=J8qQz3JQKQn+O6M+tKTL7MGavW38zRAEA@mail.gmail.com>
	<CACZkXPzv2i2wXrpXTaNNz9pF=yhiZFHn1eXWohQPLLjr68i8Zg@mail.gmail.com>
	<CAB917RLvJy2LyxT_bpFBjXVW0xYd5nwLCKpCfp9m9EZWYfUnDA@mail.gmail.com>
Date: Mon, 6 May 2013 16:16:49 +0100
Message-ID: 
 <CAOGo0VbmC04_wKsHdNG0gvXkrmJUtrtpEAK1ZWVTSzWFpx2RWQ@mail.gmail.com>
Subject: Re: [CSV] CSV-58 is killing me
From: sebb <sebbaz@gmail.com>
To: Commons Developers List <dev@commons.apache.org>
Content-Type: multipart/alternative; boundary=047d7b5d2a78994ea404dc0e30f3

--047d7b5d2a78994ea404dc0e30f3
Content-Type: text/plain; charset=ISO-8859-1

I think I've fixed most of the issues.

I'm not entirely sure that FF, TAB and BACKSPACE should be handled the same
way as CR and LF.
The latter are special because they are EOL characters; as far as I can
tell the former don't need to be escaped, so they should not be treated
specially when unescaping.

In which case some tests are wrong (indicated by TODO).

Need more test cases to cover quotes and commentStart.


On 2 May 2013 21:32, Benedikt Ritter <britter@apache.org> wrote:

> 2013/5/2 Gary Gregory <garydgregory@gmail.com>
>
> > (top-posting cause it's getting hard to find stuff)
> >
> > >Should CSV unconditionally recognise \r \n etc as CR LF?
> > >If so, then they don't need a ! escape beforehand.
> >
> > There is ONE escape character, so if it is '!', then "\r " is just "\r".
>
>
> > If we want >1 escape character, then that's a different story. Right now,
> > we have one.
> >
>
> Sorry guys, I got the whole story wrong, then. I made the assumption that
> we always represent CR as \r no matter what escape character we use.
>
> thanks for clarifying.
>
>
> >
> > Gary
> >
> >
> > On Thu, May 2, 2013 at 4:13 PM, sebb <sebbaz@gmail.com> wrote:
> >
> > > On 2 May 2013 18:26, Benedikt Ritter <britter@apache.org> wrote:
> > >
> > > > 2013/5/2 sebb <sebbaz@gmail.com>
> > > >
> > > > > On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:
> > > > >
> > > > > > On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
> > > > > >
> > > > > >> 2013/5/1 sebb <sebbaz@gmail.com>
> > > > > >>
> > > > > >> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org>
> > wrote:
> > > > > >> >
> > > > > >> > > Hi,
> > > > > >> > >
> > > > > >> > > I have tried to solve CSV-58 - Escape handling needs
> > rethinking
> > > > [1]
> > > > > a
> > > > > >> few
> > > > > >> > > times over the past weeks now. But I can not see a way to
> get
> > > this
> > > > > >> > working.
> > > > > >> > > There are two problems with our current implementation:
> > > > > >> > >
> > > > > >> > > 1. a sequence of " test \a test" will be parsed to "test a
> > > test",
> > > > > >> > although
> > > > > >> > > there is no reason to escape 'a'. The expected behavior is
> to
> > > let
> > > > > the
> > > > > >> > > sequence \a unchanged.
> > > > > >> > >
> > > > > >> >
> > > > > >> > An alternative would be to throw an Exception.
> > > > > >> >
> > > > > >>
> > > > > >> Haven't thought about that alternative so far. It is was the
> java
> > > > > compiler
> > > > > >> would do if it finds "\a" in a String.
> > > > > >> If we decide to through an exception, the message should state
> > very
> > > > > >> clearly
> > > > > >> what has gone wrong, including the position of the misplaced
> > escape
> > > > > char.
> > > > > >> I'd like to hear more opinions on this.
> > > > > >>
> > > > > >>
> > > > > > Throwing an exception stops the parsing, and obviously only
> catches
> > > the
> > > > > > first error.
> > > > > > Leaving the sequences as is means potentially having to scan a
> > large
> > > > > > output file to find them all.
> > > > > >
> > > > > > There's another option: invoke a callback when any invalid
> > sequences
> > > > are
> > > > > > detected.
> > > > > >
> > > > > > That would allow for various possibilities, depending on what
> > > callback
> > > > > was
> > > > > > provided.
> > > > > > - throw Exception with details
> > > > > > - ignore (with optional log)
> > > > > > - replace with some other sequence (with optional log)
> > > > > >
> > > > > > I don't think CSV should do any logging of its own, but the
> caller
> > > can
> > > > do
> > > > > > so in the callback
> > > > > >
> > > > > > However we would need to decide what the default callback should
> > do.
> > > > > > If we throw an Exception, then it is obvious that something is
> > wrong.
> > > > > > If we ignore the sequence (leave as is) then there is no easy way
> > for
> > > > the
> > > > > > caller to know if there have been any errors.
> > > > > >
> > > > > >
> > > > > >> >
> > > > > >> >
> > > > > >> > > 2. If the escape char is different from backslash (for
> example
> > > > > '!'), a
> > > > > >> > > sequence of "test !r test" will be parsed to "test CR test"
> > > > > >> > >
> > > > > >> > >
> > > > > >> > Is that how other CSV parsers behave? i.e. do they always use
> \
> > > for
> > > > > >> > escaping EOL?
> > > > > >> >
> > > > > >>
> > > > > >> Haven't tested this. I'll try to have a look this weekend.
> > > > > >>
> > > > > >> The problem with the current impl is a backslash as part of an
> > > escape
> > > > > >> sequence.
> > > > > >> This is what readEscape() currently does: It checks if the lexer
> > has
> > > > > >> reached a character combination of (for example) \r and replaces
> > > this
> > > > > with
> > > > > >> CR. Otherwise it swallows the escape character and returns the
> > next
> > > > > >> character.
> > > > > >>
> > > > > >> There are two problems with this:
> > > > > >> 1. it does so if it recognizes the escape character (which
> doesn't
> > > > have
> > > > > to
> > > > > >> be a backslash). This is why !r gets replaced by CR.
> > > > > >> 2. it swallows the escape and returns the next character
> (default
> > > > branch
> > > > > >> in
> > > > > >> the switch statement). This should only happen if the next
> > character
> > > > is
> > > > > >> one
> > > > > >> of the special characters from the format (e.g. the delimiter).
> > This
> > > > is
> > > > > >> why
> > > > > >> '\a' gets replaced by 'a'
> > > > > >>
> > > > > >
> > > > > > I think that's wrong, because one cannot distinguish "\a" from
> a".
> > > > > >
> > > > > > The readEscape() method currently returns an int; this cannot be
> > used
> > > > to
> > > > > > distinguish between a valid escape and some other character.
> > > > > > If we only want to support throwing an Exception for invalid
> escape
> > > > > > sequence, then it's trivial to fix this - just throw the
> Exception.
> > > > > >
> > > > > > However, to support "as is" passthrough of invalid sequences, the
> > > > caller
> > > > > > needs to know when to restore the escape character to the output.
> > > > > >
> > > > > > One way to do this would be to change the return type to a char[]
> > > > array.
> > > > > > This would either be a single char or the escape followed by a
> > single
> > > > > char.
> > > > > > The return array could just be appended to the buffer. This
> should
> > > work
> > > > > > quite well with a call-back, as the callback could also return a
> > char
> > > > > array.
> > > > > >
> > > > > > The only downside I can see is that append(char[]) may be
> slightly
> > > > slower
> > > > > > than append(char).
> > > > > >
> > > > > >
> > > > > Found another way to do it: return -1 (EOF) which cannot otherwise
> be
> > > > > returned by the method.
> > > > > The last character is then accessible via in.getLastChar().
> > > > > i.e. if -1 is returned, output escape followed by lastChar.
> > > > >
> > > > >
> > > > > Needs some tweaks to readEscape, because \LF etc. stand for
> > themselves,
> > > > > i.e. both \r and \CR => CR.
> > > > >
> > > > > What needs to be decided is whether both !r and !CR => CR if the
> > escape
> > > > > character is ! rather than \, and if so, whether \CR still stands
> for
> > > CR.
> > > > >
> > > > > But I think the parsing can be tweaked without too much rewriting.
> > > > >
> > > >
> > > > I think there are four cases to cover (I'm using ! as escape in this
> > > > example):
> > > > 1. !r => nothing to escape, append '!r'
> > > >
> > >
> > > Not sure I agree with that; currently \r => CR if \ is the escape.
> > >
> > >
> > > > 2. !\r => the CR character in its character representation. Replace
> > '\r'
> > > by
> > > > CR and escape it. continue parsing of the token
> > > >
> > >
> > > Not sure I agree here either. I think that's an invalid sequence
> unless \
> > > is one of the meta-characters.
> > >
> > > Should CSV unconditionally recognise \r \n etc as CR LF?
> > > If so, then they don't need a ! escape beforehand.
> > >
> > >
> > > > 3. !CR => esacpe CR and contine parsing of the token
> > > >
> > >
> > > !CR should be replaced by CR, i.e. *un*escaped
> > >
> > > 4. !\CR => escape '\' if it is one of the characters used in the
> format,
> > in
> > > > other words: remove the '!' and append the '\'. Then EOL => next
> > record,
> > > > but only if CR is the record separator (?).
> > > >
> > >
> > >
> > > > I've tried several times and failed. Do you have the time to
> implement
> > > your
> > > > proposal? I'll probably have some time this weekend to try once again
> > :-)
> > > >
> > > >
> > > Yes, I think I can make a start on amending readEscape.
> > >
> > > However that won't get us anywhere until the rules are clear; I don't
> > think
> > > they are currently.
> > >
> > > It should be possible to round-trip escaping then unescaping.
> > >
> > > It may be easier to consider all possible escaped output to decide how
> to
> > > unescape, and what is not a valid sequence.
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > > An alternative is to throw an exception.
> > > > > >>
> > > > > >> I think it may be necessary to separate backslash handling from
> > > escape
> > > > > >> handling.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > > Possibly. However that may make parsing harder.
> > > > > >
> > > > > >
> > > > > >> >
> > > > > >> >
> > > > > >> > > I have tried different approaches to handle this issue but
> > there
> > > > are
> > > > > >> > always
> > > > > >> > > corner cases that I'm unable to cover. I'm getting the
> feeling
> > > > that
> > > > > >> I'm
> > > > > >> > > just to dump to solve this, so any comments or suggestions
> > would
> > > > be
> > > > > >> > really
> > > > > >> > > appreciated!
> > > > > >> > >
> > > > > >> > >
> > > > > >> > Perhaps it would help to commit the unit tests you are working
> > to
> > > > > solve.
> > > > > >> > The tests can be disabled.
> > > > > >> >
> > > > > >> > Or perhaps better use a class whose name is not one of the
> ones
> > > > picked
> > > > > >> up
> > > > > >> > by Surefire.
> > > > > >> > This allows the test to be run independently by Eclipse (and
> > > Maven)
> > > > > but
> > > > > >> > won't mess with CI builds.
> > > > > >> >
> > > > > >>
> > > > > >> I have already committed three tests to CSVLexerTest and set
> them
> > to
> > > > be
> > > > > >> ignored. Maybe you can review those and tell me whether my
> > > assumptions
> > > > > are
> > > > > >> correct.
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> > Benedikt
> > > > > >> > > [1] https://issues.apache.org/jira/browse/CSV-58
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > http://people.apache.org/~britter/
> > > > > >> > > http://www.systemoutprintln.de/
> > > > > >> > > http://twitter.com/BenediktRitter
> > > > > >> > > http://github.com/britter
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> http://people.apache.org/~britter/
> > > > > >> http://www.systemoutprintln.de/
> > > > > >> http://twitter.com/BenediktRitter
> > > > > >> http://github.com/britter
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > http://people.apache.org/~britter/
> > > > http://www.systemoutprintln.de/
> > > > http://twitter.com/BenediktRitter
> > > > http://github.com/britter
> > > >
> > >
> >
> >
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > Java Persistence with Hibernate, Second Edition<
> > http://www.manning.com/bauer3/>
> > JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > Spring Batch in Action <http://www.manning.com/templier/>
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
> >
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>

--047d7b5d2a78994ea404dc0e30f3--