commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [CSV] Performance
Date Fri, 16 Mar 2012 06:04:42 GMT
On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg <ebourg@apache.org> wrote:

> ...
> 1. Buffer the data in the BufferedReader
> 2. Accumulate data in a reusable buffer for the current token
>

Reusable buffers are usually death in terms of subtle bugs and they rarely
actually help that much.  The key is to avoid copying.  Cons'ing up and
then collecting pointerized structures isn't that expensive.

3. Turn the token buffer into a String
>

Also EVIL.  Musn't convert to string unless that is really what you want.


> I was also thinking at something similar to reduce the string copies. The
> token from the CSVLexer could probably contain a CharSequence instead of a
> String. The CharSequence would be backed by the same array for all the
> fields of the record. Thus if a field isn't read by the user we don't pay
> the cost to convert it into a String. But this prevents the reuse of the
> buffer, and that means more work for the GC.
>

Just moving around char costs twice as much as moving around bytes for most
CSV data.  I would avoid that if possible.

I wouldn't worry about the GC.  The experience in Hadoop and Lucene is that
the effort made to avoid allocating light weight structures was very
misguided.  My own experiments have never shown a big benefit unless you
conflate cons'ing the structures with copying lots of data.  If you avoid
the copy, the construction and collection of ephemeral structures turns out
to be very nearly free.


> Emmanuel Bourg
>
>
> Le 15/03/2012 15:49, Ted Dunning a écrit :
>
>> I built a limited CSV package for parsing data in Mahout at one point.  I
>> doubt that it was general enough to be helpful here, but the experience
>> might be.
>>
>> The thing that *really* made a big difference in speed was to avoid copies
>> and conversions to String.  To do that, I built a state machine that
>> operated on bytes to do the parsing from byte arrays.  The parser passed
>> around offsets only.  Then when converting data, I converted directly from
>> the original byte array into the target type.  For the most common case
>> (in
>> my data) of converting to Integers, this eliminated masses of cons'ing and
>> because the conversion was special purpose (I assumed UTF8 encoding and
>> assumed that numbers could only use ASCII range digits), the conversion to
>> integers was particularly fast.
>>
>> Overall, this made about a 20x difference in speed.  This is not 20%; the
>> final time was 5% of the original.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message