commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <>
Subject [csv] feature and performance analysis
Date Sun, 15 Jan 2006 05:31:01 GMT
Spent a little time over the last week doing both performance and
feature set analysis of the 5 open-source CSV libraries that I'm aware

First up, feature sets:

Secondly, performance. The code/data is sitting in:

Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].


Take with a grain of salt, these aren't quite the pattern I was seeing
when running a few days ago on the plane. Ideally needs to run
multiple time and take median or some such.

Generally, Ostermiller is fastest for parsing, with Skife then Open a
chunk behind. GJ a little behind them and Commons lagging by a lot.

On printing, Open edges Skife, GJ a bit behind, Commons and then
Ostermiller lagging a lot.


What did I learn from this.

* Lots of features out there, no library contains all of them. I don't
think that many of them are mutually exclusive.

* Ostermiller's parser is very quick. Possibly because he's built on
top of JFlex? The printer is very, very slow.

* The current Commons parser is very slow. As is the printer.

* I also did some bug checking. Given that I only had 7 quick lines of
pain, I don't think any parser managed to parse them with much

Mostly; that despite the odd looks I get when I mention having a
commons-csv (people think they're dumb simple things), we all have
lots of room for improvement.


So what next? Are the poor performance stats for Commons-CSV a worry?
Are they offset enough by having more features? Should we look into a
lexical tool approach as it seems to work for Ostermiller?

Class-wise, I'd like to see something like:

Csv         (instead of using String[][] or List of String[])
CsvStrategy  (used by both printer and parser)

I don't see any reason to not want every feature in the feature file.

That's it for the night. If you're not on commons-dev, mail to join in. Cc's are unlikely
to last too long on a thread. Make sure you keep the [csv] on the
emails and future ones, useful way of separating the components out.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message