Return-Path: Delivered-To: apmail-jakarta-commons-dev-archive@www.apache.org Received: (qmail 73088 invoked from network); 15 Jan 2006 05:31:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Jan 2006 05:31:26 -0000 Received: (qmail 69121 invoked by uid 500); 15 Jan 2006 05:31:23 -0000 Delivered-To: apmail-jakarta-commons-dev-archive@jakarta.apache.org Received: (qmail 69062 invoked by uid 500); 15 Jan 2006 05:31:23 -0000 Mailing-List: contact commons-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Help: List-Post: List-Id: "Jakarta Commons Developers List" Reply-To: "Jakarta Commons Developers List" Delivered-To: mailing list commons-dev@jakarta.apache.org Received: (qmail 69047 invoked by uid 99); 15 Jan 2006 05:31:22 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jan 2006 21:31:22 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of flamefew@gmail.com designates 64.233.162.199 as permitted sender) Received: from [64.233.162.199] (HELO zproxy.gmail.com) (64.233.162.199) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jan 2006 21:31:21 -0800 Received: by zproxy.gmail.com with SMTP id l1so807394nzf for ; Sat, 14 Jan 2006 21:31:01 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition; b=q1IL+mbE/E0yWcRy09hP8WMeKOtN+4q+OV5aIJMcXeqPK2XIASSrk+VKUyxrLknC+5atGjNc3thv0KXTNatHLvD/ZNOAv3vAYNDWm3ngV0NGX4Z9sEBwu2Qfg1uV6Mp0chkM5K0T4lYZwLh62Fxlp6bIknBMM0tX4Xlvzkc1F0A= Received: by 10.37.2.71 with SMTP id e71mr4036981nzi; Sat, 14 Jan 2006 21:31:01 -0800 (PST) Received: by 10.36.13.11 with HTTP; Sat, 14 Jan 2006 21:31:01 -0800 (PST) Message-ID: <31cc37360601142131h6e6b0afavd37743eb5d4ae926@mail.gmail.com> Date: Sun, 15 Jan 2006 00:31:01 -0500 From: Henri Yandell To: Jakarta Commons Developers List Subject: [csv] feature and performance analysis Cc: "Sean C. Sullivan" , Steven Caswell , Brian McCallister , Glen Smith , Stefan Rufer , Urs Hardegger MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Spent a little time over the last week doing both performance and feature set analysis of the 5 open-source CSV libraries that I'm aware of. First up, feature sets: http://people.apache.org/~bayard/commons-csv/csv-features.xhtml Secondly, performance. The code/data is sitting in: http://people.apache.org/~bayard/commons-csv/csv-perf/ Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble]= . Results http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv Take with a grain of salt, these aren't quite the pattern I was seeing when running a few days ago on the plane. Ideally needs to run multiple time and take median or some such. Generally, Ostermiller is fastest for parsing, with Skife then Open a chunk behind. GJ a little behind them and Commons lagging by a lot. On printing, Open edges Skife, GJ a bit behind, Commons and then Ostermiller lagging a lot. -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- What did I learn from this. * Lots of features out there, no library contains all of them. I don't think that many of them are mutually exclusive. * Ostermiller's parser is very quick. Possibly because he's built on top of JFlex? The printer is very, very slow. * The current Commons parser is very slow. As is the printer. * I also did some bug checking. Given that I only had 7 quick lines of pain, I don't think any parser managed to parse them with much success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependabilit= y.csv Mostly; that despite the odd looks I get when I mention having a commons-csv (people think they're dumb simple things), we all have lots of room for improvement. ----- So what next? Are the poor performance stats for Commons-CSV a worry? Are they offset enough by having more features? Should we look into a lexical tool approach as it seems to work for Ostermiller? Class-wise, I'd like to see something like: Csv (instead of using String[][] or List of String[]) CsvPrinter CsvParser CsvException CsvStrategy (used by both printer and parser) I don't see any reason to not want every feature in the feature file. That's it for the night. If you're not on commons-dev, mail commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely to last too long on a thread. Make sure you keep the [csv] on the emails and future ones, useful way of separating the components out. Hen --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org