cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Sanderson <gra...@vast.com>
Subject Re: To batch or not to batch: A question for fast inserts
Date Sun, 27 Sep 2015 17:06:16 GMT
We are about to prototype upgrading our batch inserts, so I’m really glad about this thread…
we are able to saturate our dedicated network links from hadoop when inserting via thrift
API (Astyanax) - at the time we wrote that code CQL wasn’t there.

Reasons to replace our current solution:

1) We don’t have GC problems, but the Thrift lack of streaming (or the C* decision not to
have Quaid turn it on)… definitely means we allocate a lot more memory on the server than
we should
2) Lack of token awareness.
3) Thrift is not given any love.

Tuning the batch size is key - apart from the performance sweet spot, we had a bug at first
where we submitted batches by size not number of mutations - we store “key frame” and
then deltas of our data, so we were batching up to about 50k mutations in the latter case,
which might be OK, but is horrible if one mutation fails!

> On Sep 27, 2015, at 11:45 AM, Gerard Maas <gerard.maas@gmail.com> wrote:
> 
> Hi Eric, Ryan,
> 
> Thanks a lot for your insights. I got more than I hoped for in this discussion.
> I'll further improve our code to include the replica-awareness and will compare that
to the previous tests.
> 
> That snipped of code is really helpful. Thanks.
> 
> I have not been in the list long enough to have read the discussion you mention. And
search delivers too many results. Any pointer to that? 
> What are the doom scenarios that unlogged batch would bring along? 
> 
> When we ported the results of our tests to our application logic, the Cassandra cluster
CPU load dropped 66% compared to the same ingest rate using single statement async loading.
We are certainly a step further in this area.
> 
> I'll drop our findings in a blog post. When I did my initial research, I didn't find
any material that would support unlogged batch as being an alternative for performance insertion.
The blogosphere seems to support async as the only approach. As Ryan mentioned, that is case
dependent and I think devs should be exposed to the pro's and con's of every alternative to
enable them to evaluate the best approach for their particular scenario.
> 
> Thanks a ton!
> 
> Kind regards, Gerard.
> 
> On Fri, Sep 25, 2015 at 10:04 PM, Eric Stevens <mightye@gmail.com <mailto:mightye@gmail.com>>
wrote:
> Yep, my approach is definitely naive to hotspotting.  If someone had that trouble, they
could exhaust the iterator out of getReplicas() and distribute their writes more evenly (which
might result in better statement distribution, but wouldn't change the workload on the cluster).
 In the end they're going to get in trouble with hotspotting regardless of async single statements
or batches.  The single statement async code prefers the first replica returned, so this logic
is consistent with the default model.
> 
> > Lots of folks are still stuck on maximum utilization, ironically these same people
tend to focus on using spindles for storage and so will ultimately end up having to throttle
ingest to allow compaction to catch up
> 
> Yeah, otherwise known cost sensitivity, with the unfortunate side effect of making it
easy to accidentally overwhelm a cluster as a new operator since the warning signs look different
than they do for most other data stores.  
> 
> Straying a bit far afield here, but I actually think it would be a nice feature if by
default Cassandra artificially throttled writes as compaction starts getting behind as an
early warning sign (a feature you could turn off with a flag).  Cassandra does a great job
of absorbing bursty writes, but unfortunately that masks (for the new operator) the warning
signs that your sustained write rate is more than the cluster can handle.  Writes are still
fast so you assume the cluster is healthy, and by the time there's backpressure to the client,
you're already possibly past the point of simple recovery (eg you no longer have enough excess
IO to support bootstrapping new nodes).  That would also actually free up some I/O to keep
the cluster from tipping over so hard.
> 
> On Fri, Sep 25, 2015 at 12:14 PM, Ryan Svihla <rs@foundev.pro <mailto:rs@foundev.pro>>
wrote:
> 
> I think my main point is still, unlogged token aware batches are great, but if you’re
writes are large enough, they may actually hurt rather than help, and likewise if your writes
are too small, async only is likely only going to hurt. I’d say the average user I’ve
had to help (with my selection bias) has individual writes already on the large size of optimal
so batching frequently hurts them. Also they tend not to do async in the first place.
> 
> In summary, batch or not is IMO the wrong area to focus, total write payload sizing for
your cluster is the factor to focus on and however you get there is fantastic. more replies
inline:
> 
>> On Sep 25, 2015, at 1:24 PM, Eric Stevens <mightye@gmail.com <mailto:mightye@gmail.com>>
wrote:
>> 
>> > compaction usually is the limiter for most clusters, so the difference between
async versus unlogged batch ends up being minor or worse..non existent cause the hardware
and data model combination result in compaction being the main throttle.
>> 
>> If your number of records to load per second is predetermined (as would be the case
in any production use case), then this doesn't make any difference on compaction whether loaded
as batches vs as single statements, your cluster needs to support the same number and shape
of mutates either way.
> 
> Not everyone is as grown up about their cluster sizing. Lots of folks are still stuck
on maximum utilization, ironically these same people tend to focus on using spindles for storage
and so will ultimately end up having to throttle ingest to allow compaction to catch up. Anyway
in these admittedly awful situations throttling of ingest is all too common as the commit
log can basically easily outstrip compaction. 
> 
>> 
>> > if you add in token awareness to your batch..you’ve basically eliminated the
primary complaint of using unlogged batches so why not do that. 
>> 
>> This is absolutely the right idea if your driver supports it, but the gain is smaller
than I would have expected based on the warnings of imminent doom when we've had this conversation
before.  If your driver supports token awareness, use that to group statements by primary
replica and concurrently execute those that way.  Here's the code we're using (in Scala using
the Java driver):
>> def groupByFirstReplica()(implicit session: CQLSession): Map[Host, CQLBatch] = {
>>   val meta = session.getCluster.getMetadata
>>   statements.groupBy { st =>
>>     try {
>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>     } catch { case NonFatal(e) =>
>>       null
>>     }
>>   } mapValues { st => CQLBatch(st) }
>> }
>> We now have a map of primary host to sub-batch for all the statements in our logical
batch.  We can now do either of these (depending on how greedy we want to be in our client;
Future.traverse is preferred and nicer, Future.sequence is greedier and more resource intensive):
>> Future.sequence(groupByFirstReplica().values.map(_.execute())).map(_.flatten)
>> Future.traverse(groupByFirstReplica().values) { _.execute() }.map(_.flatten)
>> We get back Future[Iterable[ResultSet]] - this future completes when the logical
batch's sub-batches have all completed.
>> 
>> Note that with the DSE Java driver, for the above to succeed in its intent, the statements
need to be prepared statements (for st.getRoutingKey to return non-null), and either the keyspace
has to be fully defined in the CQL, or you have to have set the correct keyspace when you
created the connection (for st.getKeyspace to return non-null).  Otherwise the values given
to meta.getReplicas will fail to resolve a primary host which results in doing token-unaware
batches (i.e. you'll get back a Map(null -> allStatements)).  However those same criteria
are required for single statements to be token aware.
>> 
> 
> This is excellent stuff, my only concern with primary replicas is for people with uneven
partitions, and the occasionally stupidly fat one. I’d rather spread those writes around
the other replicas instead of beating up the primary one. However, for a well modeled partition
key the approach you outline is probably optimal.
> 
>> 
>> 
>> 
>> On Fri, Sep 25, 2015 at 7:30 AM, Ryan Svihla <rs@foundev.pro <mailto:rs@foundev.pro>>
wrote:
>> Generally this is all correct but I cannot emphasize enough how much this “just
depends” and today I generally move people to async inserts first before trying to micro-optimize
some things to keep in mind.
>> 
>> compaction usually is the limiter for most clusters, so the difference between async
versus unlogged batch ends up being minor or worse..non existent cause the hardware and data
model combination result in compaction being the main throttle.
>> if you add in token awareness to your batch..you’ve basically eliminated the primary
complaint of using unlogged batches so why not do that. When I was at DataStax I made some
similar suggestions for token aware batch after seeing the perf improvements with Spark writes
using unlogged batch. Several others did as well so I’m not the first one with this idea.
>> write size makes in my experience the largest difference BY FAR about which is faster.
and the number is largely irrelevant compared to the total payload size. Depending on the
hardware and etc a good rule of thumb is writes below 1k bytes tend to get really inefficient
and writes that are over 100k tend to slow down total throughput. I’ll reemphasize this
magic number has been different on almost every cluster I’ve tuned.
>> 
>> In summary all this means is, too small or too large of writes are slow, and unlogged
batches may involve some extra hops, if you eliminate the extra hops by token awareness then
it just comes down to write size optimization.
>> 
>>> On Sep 24, 2015, at 5:18 PM, Eric Stevens <mightye@gmail.com <mailto:mightye@gmail.com>>
wrote:
>>> 
>>> > I side-tracked some punctual benchmarks and stumbled on the observations
of unlogged inserts being *A LOT* faster than the async counterparts.
>>> 
>>> My own testing agrees very strongly with this.  When this topic came up on this
list before, there was a concern that batch coordination produces GC pressure in your cluster
because you're involving nodes which aren't strictly speaking necessary to be involved.  
>>> 
>>> Our own testing shows some small impact on this front, but really lightweight
GC tuning mitigated the effects by putting a little more room in Xmn (if you're still on CMS
garbage collector).  On G1GC (which is what we run in production) we weren't able to measure
a difference. 
>>> 
>>> Our testing shows data loads being as much as 5x to 8x faster when using small
concurrent batches over using single statements concurrently.  We tried three different concurrency
models.
>>> 
>>> To save on coordinator overhead, we group the statements in our "batch" by replica
(using the functionality exposed by the DataStax Java driver), and do essentially token aware
batching.  This still has a small amount of additional coordinator overhead (since the data
size of the unit of work is larger, and sits in memory in the coordinator longer).  We've
been running this way successfully for months with sustained rates north of 50,000 mutates
per second.  We burst much higher.
>>> 
>>> Through trial and error we determined we got diminishing returns in the realm
of 100 statements per token-aware batch.  It looks like your own data bears that out as well.
 I'm sure that's workload dependent though.
>>> 
>>> I've been disagreed with on this topic in this list in the past despite the numbers
I was able to post.  Nobody has shown me numbers (nor anything else concrete) that contradict
my position though, so I stand by it.  There's no question in my mind, if your mutates are
of any significant volume and you care about the performance of them, token aware unlogged
batching is the right strategy.  When we reduce our batch sizes or switch to single async
statements, we fall over immediately.  
>>> 
>>> On Tue, Sep 22, 2015 at 7:54 AM, Gerard Maas <gerard.maas@gmail.com <mailto:gerard.maas@gmail.com>>
wrote:
>>> General advice advocates for individual async inserts as the fastest way to insert
data into Cassandra. Our insertion mechanism is based on that model and recently we have been
evaluating performance, looking to measure and optimize our ingestion rate.
>>> 
>>> I side-tracked some punctual benchmarks and stumbled on the observations of unlogged
inserts being *A LOT* faster than the async counterparts.
>>> 
>>> In our tests, unlogged batch shows increased throughput and lower cluster CPU
usage, so I'm wondering where the tradeoff might be.
>>> 
>>> I compiled those observations in this document that I'm sharing and opening up
for comments.  Are we observing some artifact or should we set the record straight for unlogged
batches to achieve better insertion throughput?
>>> 
>>> https://docs.google.com/document/d/1qSIJ46cmjKggxm1yxboI-KhYJh1gnA6RK-FkfUg6FrI
<https://docs.google.com/document/d/1qSIJ46cmjKggxm1yxboI-KhYJh1gnA6RK-FkfUg6FrI>
>>> 
>>> Let me know.
>>> 
>>> Kind regards, 
>>> 
>>> Gerard.
>>> 
>> 
>> Regards,
>> 
>> Ryan Svihla
>> 
>> 
> 
> 
> 


Mime
View raw message