commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject [csv] Creating a CSV component
Date Thu, 09 Jun 2005 01:46:01 GMT
Thought I'd summarise the CSV threads on the user list. Due to the
high level of user activity in those threads, I've mailed this to both
user and dev lists. Interested users should probably hop over to the
dev list at some point as I imagine future threads will consolidate
there.

Basic idea is for a parser library for csv files. There is a lot of
interest and various pieces of code have been offered:

Netcetera's csvparser. 
 - configurable delimiter (delimiter may appear in complex values)
 - complex values (including newlines -> multiline values)
 - unicode escapes
 - empty line skipping support
 - comment support
 - hardcoded record separators (\n or \r\n)

OSJava's gj-csv (http://www.osjava.org/genjava/multiproject/gj-csv/)
 - configurable field/block delimiters
 - reader/writer, Map and 'DOM' apis
 - BSD licenced.
 - NOTE: My library, so I'm biased either for or against it.

Ostermiller's GPL'd csv parser has been around for ages, but is GPL.
Still, a good thing to be feature competitive with.

Brian McCallister has a new CSV library - http://kasparov.skife.org/csv/. 
 - Apache licenced.

There are alternative ways to do CSV parsing:

  - JDBC API (http://csvjdbc.sourceforge.net/)
  - using ANTLR
(http://supportweb.cs.bham.ac.uk/documentation/tutorials/docsystem/build/tutorials/antlr/antlr.html#ANTLR-Translation-Example)
  - XML API (http://www.dpawson.co.uk/java/csv2xml.html)

Another question is where the code should go. [lang] and [io] have
been suggested, as has a [csv] component. I'll go as far as to say
that [csv] is the direction we should go and see if anybody disagrees
:)

There are important issues to remember:

* include a precise reference to a spec, if available, or to an
implementation (e.g. excel,
outlook, filemaker, ...)
* release early (ie) get the basics out).
* Excel and others can be weird, it might need special support.

Wish-list features:

* bridge with Jelly (whatever Paul meant here)
* configurable column selection.
* Hibernate / struts property driven CSV read configuration. (Here I
am talking about referencing third party xml elements as target
references.)
* xsl driven CSV conversions (CSV to XML, CSV to HTML, CSV to EDI, CSV
to *new format*)
*. CSVFilter as that for FileFilter ->  column range, column width
range, row range


I'm prepared to help on a commons csv component. I obviously have the
itch/need for a basic csv library.

Opinions on where to start seem like the best direction.

* Netcetera
* OSJava
* Ostermiller (aka, people could ask for a licence change :) )
* Skife
* Start afresh

----

Looking at the Netcetera source, it looks to be nicely polished. Lots
of options to handle the CSV variations, fully javadoc'd and probably
100% test coverage (which is better than mine).

Src jar at:  
  ftp://ftp.netcetera.ch/pub/csvparser.jar
(not licenced for use currently, please don't use unless it becomes an
ASF codebase)

I'm interested in hearing any criticism of the Netcetera source.
API-wise, it's pretty much the way I think we all go with such a
problem, reader/writer-like. I added the concept of a CSV object to
mine as opposed to Netcetera's Object[][], and the CsvFieldReader is
for reading by column name and not index; but to be honest neither get
used by myself or at work.

If there's a good agreement to start with the Netcetera source, I need
to dig up the legalities on company contributions and guide Stefan and
Netcetera through them. I'm not sure if we'd want to add any features
before an initial release, but I think we'd definitely want to chug
along at adding more tests, if only to get fully inside the API.

Any thoughts?

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message