commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [CSV] Performance
Date Fri, 16 Mar 2012 06:00:50 GMT
See the Line and FastLine classes
in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout
Examples module.

You can see an older version of mahout here.  This class hasn't changed in
forever.

https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahout/classifier/sgd/SimpleCsvExamples.java

On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg <ebourg@apache.org> wrote:

> Thank you for sharing your experience Ted. Do you have a link to the code
> of your parser? I'd like to get a look.
>
> Currently the data flow in Commons CSV is:
>
> 1. Buffer the data in the BufferedReader
> 2. Accumulate data in a reusable buffer for the current token
> 3. Turn the token buffer into a String
>
> I was also thinking at something similar to reduce the string copies. The
> token from the CSVLexer could probably contain a CharSequence instead of a
> String. The CharSequence would be backed by the same array for all the
> fields of the record. Thus if a field isn't read by the user we don't pay
> the cost to convert it into a String. But this prevents the reuse of the
> buffer, and that means more work for the GC.
>
> Emmanuel Bourg
>
>
> Le 15/03/2012 15:49, Ted Dunning a écrit :
>
>> I built a limited CSV package for parsing data in Mahout at one point.  I
>> doubt that it was general enough to be helpful here, but the experience
>> might be.
>>
>> The thing that *really* made a big difference in speed was to avoid copies
>> and conversions to String.  To do that, I built a state machine that
>> operated on bytes to do the parsing from byte arrays.  The parser passed
>> around offsets only.  Then when converting data, I converted directly from
>> the original byte array into the target type.  For the most common case
>> (in
>> my data) of converting to Integers, this eliminated masses of cons'ing and
>> because the conversion was special purpose (I assumed UTF8 encoding and
>> assumed that numbers could only use ASCII range digits), the conversion to
>> integers was particularly fast.
>>
>> Overall, this made about a 20x difference in speed.  This is not 20%; the
>> final time was 5% of the original.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message