commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Bourg <ebo...@apache.org>
Subject Re: [CSV] Performance
Date Thu, 15 Mar 2012 15:11:06 GMT
Thank you for sharing your experience Ted. Do you have a link to the 
code of your parser? I'd like to get a look.

Currently the data flow in Commons CSV is:

1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a String

I was also thinking at something similar to reduce the string copies. 
The token from the CSVLexer could probably contain a CharSequence 
instead of a String. The CharSequence would be backed by the same array 
for all the fields of the record. Thus if a field isn't read by the user 
we don't pay the cost to convert it into a String. But this prevents the 
reuse of the buffer, and that means more work for the GC.

Emmanuel Bourg


Le 15/03/2012 15:49, Ted Dunning a écrit :
> I built a limited CSV package for parsing data in Mahout at one point.  I
> doubt that it was general enough to be helpful here, but the experience
> might be.
>
> The thing that *really* made a big difference in speed was to avoid copies
> and conversions to String.  To do that, I built a state machine that
> operated on bytes to do the parsing from byte arrays.  The parser passed
> around offsets only.  Then when converting data, I converted directly from
> the original byte array into the target type.  For the most common case (in
> my data) of converting to Integers, this eliminated masses of cons'ing and
> because the conversion was special purpose (I assumed UTF8 encoding and
> assumed that numbers could only use ASCII range digits), the conversion to
> integers was particularly fast.
>
> Overall, this made about a 20x difference in speed.  This is not 20%; the
> final time was 5% of the original.


Mime
View raw message