commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: [CSV] Performance
Date Fri, 16 Mar 2012 06:00:50 GMT
See the Line and FastLine classes
in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout
Examples module.

You can see an older version of mahout here.  This class hasn't changed in

On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg <> wrote:

> Thank you for sharing your experience Ted. Do you have a link to the code
> of your parser? I'd like to get a look.
> Currently the data flow in Commons CSV is:
> 1. Buffer the data in the BufferedReader
> 2. Accumulate data in a reusable buffer for the current token
> 3. Turn the token buffer into a String
> I was also thinking at something similar to reduce the string copies. The
> token from the CSVLexer could probably contain a CharSequence instead of a
> String. The CharSequence would be backed by the same array for all the
> fields of the record. Thus if a field isn't read by the user we don't pay
> the cost to convert it into a String. But this prevents the reuse of the
> buffer, and that means more work for the GC.
> Emmanuel Bourg
> Le 15/03/2012 15:49, Ted Dunning a écrit :
>> I built a limited CSV package for parsing data in Mahout at one point.  I
>> doubt that it was general enough to be helpful here, but the experience
>> might be.
>> The thing that *really* made a big difference in speed was to avoid copies
>> and conversions to String.  To do that, I built a state machine that
>> operated on bytes to do the parsing from byte arrays.  The parser passed
>> around offsets only.  Then when converting data, I converted directly from
>> the original byte array into the target type.  For the most common case
>> (in
>> my data) of converting to Integers, this eliminated masses of cons'ing and
>> because the conversion was special purpose (I assumed UTF8 encoding and
>> assumed that numbers could only use ASCII range digits), the conversion to
>> integers was particularly fast.
>> Overall, this made about a 20x difference in speed.  This is not 20%; the
>> final time was 5% of the original.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message