commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rory Winston <rwins...@eircom.net>
Subject Re: [csv] feature and performance analysis
Date Sun, 15 Jan 2006 10:43:12 GMT
Some good points there. A CSV lexer-based implementation may be a good 
approach. There are a couple of helpful pointers/references here:

http://www.ricebridge.com/products/csvman/reference.htm
http://www.boyet.com/Articles/CsvParser.html

Did you run the Commons::CSV component through a profiling process?

Rory

Henri Yandell wrote:

>Spent a little time over the last week doing both performance and
>feature set analysis of the 5 open-source CSV libraries that I'm aware
>of.
>
>First up, feature sets:
>
>http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
>
>Secondly, performance. The code/data is sitting in:
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/
>
>Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
>
>Results
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
>
>Take with a grain of salt, these aren't quite the pattern I was seeing
>when running a few days ago on the plane. Ideally needs to run
>multiple time and take median or some such.
>
>Generally, Ostermiller is fastest for parsing, with Skife then Open a
>chunk behind. GJ a little behind them and Commons lagging by a lot.
>
>On printing, Open edges Skife, GJ a bit behind, Commons and then
>Ostermiller lagging a lot.
>
>-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>What did I learn from this.
>
>* Lots of features out there, no library contains all of them. I don't
>think that many of them are mutually exclusive.
>
>* Ostermiller's parser is very quick. Possibly because he's built on
>top of JFlex? The printer is very, very slow.
>
>* The current Commons parser is very slow. As is the printer.
>
>* I also did some bug checking. Given that I only had 7 quick lines of
>pain, I don't think any parser managed to parse them with much
>success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
>
>Mostly; that despite the odd looks I get when I mention having a
>commons-csv (people think they're dumb simple things), we all have
>lots of room for improvement.
>
>-----
>
>So what next? Are the poor performance stats for Commons-CSV a worry?
>Are they offset enough by having more features? Should we look into a
>lexical tool approach as it seems to work for Ostermiller?
>
>Class-wise, I'd like to see something like:
>
>Csv         (instead of using String[][] or List of String[])
>CsvPrinter
>CsvParser
>CsvException
>CsvStrategy  (used by both printer and parser)
>
>I don't see any reason to not want every feature in the feature file.
>
>That's it for the night. If you're not on commons-dev, mail
>commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
>to last too long on a thread. Make sure you keep the [csv] on the
>emails and future ones, useful way of separating the components out.
>
>Hen
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message