commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <>
Subject Re: [csv] Improving readability in CSVLexer
Date Fri, 16 Mar 2012 15:44:09 GMT
On 16 March 2012 12:33, Benedikt Ritter <> wrote:
> Hey,
> I'm thinking of ways to improve the readability of CSVLexer. I think
> that it might be easier to improve performance if the code is easier
> to understand. Here is, what I think can be improved:
> 1. eliminate Token input parameter on nextToken()
> To me it looks like the token input parameter on nextToken() has the
> purpose of sparing object creation. How about a private field
> 'currentToken' that can be reused.

As Ted Dunning said, it's probably not worth worrying about the
additional object creation.

As it stands, the "reusableToken" class field means that the CSVParser
class is not thread-safe.

If the "reusableToken" and "record" class fields were moved into the
getRecord() method, the class would then be thread-safe.
(Assuming that the Lexer is thread-safe).

These fields are only used by the getRecord() method, so (if kept) it
would be sensible to try and localise them so they are only visible to

One way to do this would be to make the CSVParser class abstract, and
move the getRecord() method into an implementation class.

> No method parameters are better than one method parameter :)

Not always ...

> 2. add additional convenience methods
> Right now we have some methods for char handling like isEndOfFile(c).
> There are some methods missing like isDelimiter(c) or
> isEncapsulator(c). There is not much to say about this. I just think
> that isDelimiter(c) is slightly easier to understand than c ==
> format.getDelimiter().

Agreed; also such methods can check if the item is disabled.
And the server JVM will inline them as necessary.

> 3. eliminate input parameter c on readEscape (and rename it ?)
> Right now we have to pass an int to readEscape, but the method does
> not use that parameter. So why do we keep it? Also the method does not
> really "read" an escape. It assumes, that is is called after a "/" and
> then returns the delimiter for a letter.

In theory I agree, but I think there's an problem with escape
processing - see CVS-58.
To fix this, we would sometimes need to retain the escape character.
It might be necessary for the method to return a different value
depending on whether the escape is backslash or not.

But if the parameter turns out not to be needed, let's drop it.

> 4. Get rid of those nasty while(true) loops!
> There are several while true loops. It is really hard to see what is
> going on, because you can not exactly see when a loop ends. The worst
> example for this is encapsulatedTokenLexer. It has an outer
> while(true) loop with a nested inner loop, that may return a token,
> terminating both loops.
> I've tried to eliminate those while true loops, but without success.

The outer loop could probably be replaced by

while(( != -1) {

// now check if EOF detected partway through a sequence

> If no one objects, I'd like to create patches for 1. & 2. I leave 3.
> and 4. for discussion...
> Regards,
> Benedikt
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message