commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Sahasrabudhe" <>
Subject RE: CSV parsing/writing?
Date Thu, 26 May 2005 07:08:00 GMT
For long I have been looking for high performance solution to process delimited data.

To put forward the problem in performance please consider this example

one,two,three,four,five ---- (, is used as delimiter)

while parsing this string one needs to traverse char by char to find the delimiter and later
act on the segment.

I am trying to figure out the solution that shall simulate normal human reading scenario.
Humans change reading habits as we read more.

initially we read char by char then make a word in our brain and attach meaning to the same.
Now as we grow ole we start picking 2 to 3 words in one read and process it pretty fast.

The point I am trying to make here is, can we make our code more intelligent to take snapshot
of data and identify pattern.
I know this sounds pretty hazy but some way to stop parsing char by char and develop algo
that shall read the memory block in chunk and identify if there are any delimiters in the
chunk. if delimiter is found then parse char by char to get the position.

take a small test here, count number of commas in each row


while looking at this test data did you do char by char parsing or snapshot reading


-----Original Message-----
From: Simon Kitching []
Sent: Thursday, May 26, 2005 12:23 PM
To: Jakarta Commons Users List
Subject: RE: CSV parsing/writing?

If the goal of the project is small, ie just a class to parse csv, then
commons-io, commons-codec, commons-lang are the obvious parties. So it's
a matter of seeing if the committers on those projects are interested.

If the goal is larger, ie creating a new commons component itself then
it is likely to be hard work. The way things usually become commons
components is that they are initially a successful part of some other
successful apache project and are spun off into a separate component
here. So one solution might be to find an apache project that would find
csv functionality useful, and then get the developers of that project to
join commons and become the "mentors" of a csv (or more ambitious)
project here.

Projects that might find csv handling useful include
 * workflow projects
 * B2B projects (geronimo?)
 * data import/export: POI?

It seems clear from the mails here that although there is some user
interest in this, there just aren't any existing committers willing to
dedicate the necessary time to mentoring this new project.

As another alternative, a project can be created on Sourceforge, using
the Apache Public License (APL). That way, apache projects like the ones
listed above can happily use the code if they find a need to process csv
in the future. And at that point, friendly discussions might occur about
moving the project to apache commons.

Apache commons really isn't in the same business as sourceforge. This
means that not every good idea gets a home here. Or to look at it the
other way, if it doesn't find a home here that doesn't mean it isn't a
good idea.

(man, csv is a hard acronym to type. At least half the time it comes out
cvs :-).



To unsubscribe, e-mail:
For additional commands, e-mail:

This message contains the information that may be privileged and is  the property of the KPIT
Cummins Infosystems LTD.It is intended only for the person to whom it is addressed. If you
are not intended recipient, you are not authorized to read, print , retain copy, disseminate,
distribute, or use this message or any part thereof. If you receive this message in error,
please notify the sender immediately and delete all copies of this message. KPIT Cummins does
not accept any liability for virus infected mails.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message