incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benoit Perroud <ben...@noisette.ch>
Subject Re: SSTableSimpleUnsortedWriter take long time when inserting big rows
Date Fri, 02 Sep 2011 09:17:08 GMT
Thanks for your answer.

2011/9/2 Sylvain Lebresne <sylvain@datastax.com>:
> On Fri, Sep 2, 2011 at 10:29 AM, Benoit Perroud <benoit@noisette.ch> wrote:
>> Hi All,
>>
>> I started using SSTableSimpleUnsortedWriter to load data, and my data
>> has a few rows but a lot of column name in each rows.
>>
>> I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted.
>>
>> But the time taken to insert columns is increasing as the column
>> family is increasing. The problem appears because everytime we call
>> newRow, all the columns of the previous CF is added to the new CF.
>
> If I understand correctly, each row has way more that 10 000 columns, but
> you call newRow every 10 000 columns, right ?

Yes. I call newRow every 10 000 columns to be sure to flush as soon as possible.

> Note that you have the possibility to decrease the frequency of the calls to
> newRow.
>
> But anyway, I agree that the code shouldn't suck like that.
>
>> Attached is a small patch that check which is the smallest CF, and add
>> the smallest CF to the biggest one.
>>
>> Should I open I bug for that ?
>
> Please do. I'm actually thinking of a slightly different fix: we should not have
> to add all the previous columns to the new column family, we should just
> directly reuse the previous column family when adding the new column.
> But the JIRA ticket will be a better place to discuss this.

Opened : https://issues.apache.org/jira/browse/CASSANDRA-3122
Let's discuss there.

Thanks !

Benoit.

> --
> Sylvain
>

Mime
View raw message