incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Boxenhorn <da...@lookin2.com>
Subject Re: What is the optimal size of batch mutate batches?
Date Tue, 11 May 2010 12:31:14 GMT
Thanks a lot! 25,000 is a number I can work with.

Any other suggestions?

On Tue, May 11, 2010 at 3:21 PM, Ben Browning <ben324@gmail.com> wrote:

> I like to base my batch sizes off of the total number of columns
> instead of the number of rows. This effectively means counting the
> number of Mutation objects in your mutation map and submitting the
> batch once it reaches a certain size. For my data, batch sizes of
> about 25,000 columns work best. You'll need to adjust this up or down
> depending on the size of your column names / values and available
> memory.
>
> With this strategy the "bushiness" of your rows shouldn't be a problem.
>
> Ben
>
>
> On Tue, May 11, 2010 at 7:54 AM, David Boxenhorn <david@lookin2.com>
> wrote:
> > I am saving a large amount of data to Cassandra using batch mutate. I
> have
> > found that my speed is proportional to the size of the batch. It was very
> > slow when I was inserting one row at a time, but when I created batches
> of
> > 100 rows and mutated them together, it went 100 times faster. (OK, I
> didn't
> > measure it, but it was MUCH faster.)
> >
> > My problem is that my rows are of very varying degrees of bushiness (i.e.
> > number of supercolums and columns per row). I inserted 592,500 rows
> > successfully, in a few minutes, and then I hit a batch of exceptionally
> > bushy rows and ran out of memory.
> >
> > Does anyone have any suggestions about how to deal with this problem? I
> can
> > make my algorithm smarter by taking into account the size of the rows and
> > not just blindly do 100 at a time, but I want to solve this problem as
> > generally as possible, and not depend on trial and error, and on the
> > specific configuration of the machine I happen to be working on right
> now. I
> > don't even know if the critical parameter is the total size of the
> values,
> > or the number of columns, or what? Or maybe there's some optimal batch
> size,
> > and that's what I should use always?
> >
> > Thanks.
> >
>

Mime
View raw message