cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Coli <rc...@eventbrite.com>
Subject Re: Performance Difference between Batch Insert and Bulk Load
Date Mon, 01 Dec 2014 22:27:46 GMT
On Mon, Dec 1, 2014 at 12:10 PM, Dong Dai <daidongly@gmail.com> wrote:

> I guess you mean that BulkLoader is done by streaming whole SSTable to
> remote servers, so it is faster?
>

Well, it's not exactly "whole SSTable" but yes, that's the sort of
statement I'm making. [1]


> The documentation says that all the rows in the SSTable will be inserted
> into the new cluster conforming to the replication strategy of that
> cluster. This gives me a felling that the BulkLoader was done by calling
> insertion after being transmitted to coordinators.
>

A good slide-deck from pgorla, here :

http://www.slideshare.net/DataStax/bulk-loading-data-into-cassandra

General background.

http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

But briefly, no. It uses the streaming interface, not the client interface.
The streaming interface results in avoiding the whole commitlog/memtable
process.

I have this question because I tried batch insertion. It is too fast and
> makes me think that BulkLoader can not beat it.
>

Turn of writes to the commitlog with durable_writes:false and you can
simulate how much faster it would be without the double-write to the
commitlog. That said, the double-write to the commitlog is one of the most
significant overheads of doing a write from the client, but it is far from
the only overhead.

=Rob

[1] http://www.datastax.com/dev/blog/streaming-in-cassandra-2-0

Mime
View raw message