cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Haddad <...@jonhaddad.com>
Subject Re: batch_size_warn_threshold_in_kb
Date Sat, 13 Dec 2014 18:07:11 GMT
On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <mightye@gmail.com> wrote:

> Isn't the net effect of coordination overhead incurred by batches
> basically the same as the overhead incurred by RoundRobin or other
> non-token-aware request routing?  As the cluster size increases, each node
> would coordinate the same percentage of writes in batches under token
> awareness as they would under a more naive single statement routing
> strategy.  If write volume per time unit is the same in both approaches,
> each node ends up coordinating the majority of writes under either strategy
> as the cluster grows.
>

If you're not token aware, there's extra coordinator overhead, yes.  If you
are token aware, not the case.  I'm operating under the assumption that
you'd want to be token aware, since I don't see a point in not doing so :)

Unfortunately my Scala isn't the best so I'm going to have to take a little
bit to wade through the code.

It may be useful to run cassandra-stress (it doesn't seem to have a mode
for batches) to get a baseline on non-batches.  I'm curious to know if you
get different numbers than the scala profiler.



>
> GC pressure in the cluster is a concern of course, as you observe.  But
> delta performance is *substantial* from what I can see.  As in the case
> where you're bumping up against retries, this will cause you to fall over
> much more rapidly as you approach your tipping point, but in a healthy
> cluster, it's the same write volume, just a longer tenancy in eden.  If
> reasonable sized batches are causing survivors, you're not far off from
> falling over anyway.
>
> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <jon@jonhaddad.com>
> wrote:
>
>> One thing to keep in mind is the overhead of a batch goes up as the
>> number of servers increases.  Talking to 3 is going to have a much
>> different performance profile than talking to 20.  Keep in mind that the
>> coordinator is going to be talking to every server in the cluster with a
>> big batch.  The amount of local writes will decrease as it owns a smaller
>> portion of the ring.  All you've done is add an extra network hop between
>> your client and where the data should actually be.  You also start to have
>> an impact on GC in a very negative way.
>>
>> Your point is valid about topology changes, but that's a relatively rare
>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>> optimize for that case.
>>
>> Can you post your test code in a gist or something?  I can't really talk
>> about your benchmark without seeing it and you're basing your stance on the
>> premise that it is correct, which it may not be.
>>
>>
>>
>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <mightye@gmail.com> wrote:
>>
>>> You can seen what the partition key strategies are for each of the
>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>> unique, and bckt is derived from end.  Some of these layouts result in
>>> clustering on the same partition keys, that's actually tunable with the
>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>> should have a mean of 15 in that run - it's an input parameter to my
>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>> each record.
>>>
>>> Your points about:
>>> 1) Failed batches having a higher cost than failed single statements
>>> 2) In my test, every node was a replica for all data.
>>>
>>> These are both very good points.
>>>
>>> For #1, since the worst case scenario is nearly twice fast in batches as
>>> its single statement equivalent, in terms of impact on the client, you'd
>>> have to be retrying half your batches before you broke even there (but of
>>> course those retries are not free to the cluster, so you probably make the
>>> performance tipping point approach a lot faster).  This alone may be cause
>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>> that's what this discussion is about!).
>>>
>>> For #2, that's certainly a good point, for this test cluster, I should
>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>> you're not using a token aware client or not using a token aware policy for
>>> whatever reason, this should even out though, no?  Each node will end up
>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>> batched or single statements.  The DS driver is very careful to caution
>>> that the topology map it maintains makes no guarantees on freshness, so you
>>> may see a significant performance penalty in your client when the topology
>>> changes if you're depending on token aware routing as part of your
>>> performance requirements.
>>>
>>>
>>> I'm curious what your thoughts are on grouping statements by primary
>>> replica according to the routing policy, and executing unlogged batches
>>> that way (so that for token aware routing, all statements are executed on a
>>> replica, for others it'd make no difference).  Retries are still more
>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>> easy to do in Scala:
>>>
>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>> session: Session): Map[Host, Seq[Statement]] = {
>>>     val meta = session.getCluster.getMetadata
>>>     statements.groupBy { st =>
>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>>>     }
>>>   }
>>>   val result =
>>> Future.traverse(groupByFirstReplica(statements).values).map(st =>
>>> newBatch(st).executeAsync())
>>>
>>>
>>> Let me get together my test code, it depends on some existing utilities
>>> we use elsewhere, such as implicit conversions between Google and Scala
>>> native futures.  I'll try to put this together in a format that's runnable
>>> for you in a Scala REPL console without having to resolve our internal
>>> dependencies.  This may not be today though.
>>>
>>> Also, @Ryan, I don't think that shuffling would make a difference for my
>>> above tests since as Jon observed, all my nodes were already replicas there.
>>>
>>>
>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rsvihla@datastax.com>
>>> wrote:
>>>
>>>> Also..what happens when you turn on shuffle with token aware?
>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>>>
>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <jon@jonhaddad.com>
>>>> wrote:
>>>>>
>>>>> To add to Ryan's (extremely valid!) point, your test works because the
>>>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>> Batching works great at RF=N=3 because it always gets to write to local
and
>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>> overhead on the server side.
>>>>>
>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>> grouping (see
>>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>>>> which massively helps performance.  It provides the benefit of batches
but
>>>>> without the coordinator overhead.
>>>>>
>>>>> Can you post your benchmark code?
>>>>>
>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <jon@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>> mutations to the same partition (and talk to a replica for that partition)
>>>>>> they can reduce network overhead because they're effectively a single
>>>>>> mutation in the eye of the cluster.  However, if you're not doing
that (and
>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>> coordinator because now it has to talk to several other servers.
 If you
>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could
have
>>>>>> a coordinator that's
>>>>>>
>>>>>> 1) talking to every machine in the cluster and
>>>>>> b) waiting on a response from a significant portion of them
>>>>>>
>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>> disk, can affect the performance of the entire batch.
>>>>>>
>>>>>>
>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>> jack@basetechnology.com> wrote:
>>>>>>
>>>>>>>   Jonathan and Ryan,
>>>>>>>
>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>> trying to lump queries together to reduce network & server
overhead - in
>>>>>>> fact it'll do the opposite”, but I would note that the CQL3
spec says “
>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>> between the server coordinator and the replicas) when batching
multiple
>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict
with your
>>>>>>> statement.
>>>>>>>
>>>>>>> See:
>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>
>>>>>>> I see the spec as gospel – if it’s not accurate, let’s
propose a
>>>>>>> change to make it accurate.
>>>>>>>
>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>> can save network exchanges between the client/server and server
>>>>>>> coordinator/replicas. However, because of the distributed nature
of
>>>>>>> Cassandra, spread requests across nearby nodes as much as possible
to
>>>>>>> optimize performance. Using batches to optimize performance is
usually not
>>>>>>> successful, as described in Using and misusing batches section.
For
>>>>>>> information about the fastest way to load data, see "Cassandra:
Batch
>>>>>>> loading without the Batch keyword."”
>>>>>>>
>>>>>>> Maybe what we really need is a “client/driver-side batch”,
which is
>>>>>>> simply a way to collect “batches” of operations in the client/driver
and
>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>> operation is appropriate.
>>>>>>>
>>>>>>> It might also be nice to have an inquiry for the cluster as to
what
>>>>>>> batch size is most optimal for the cluster, like number of mutations
in a
>>>>>>> batch and number of simultaneous connections, and to have that
be dynamic
>>>>>>> based on overall cluster load.
>>>>>>>
>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>> with different partition key values, which flies in the face
of the
>>>>>>> admonition to to refrain from using server-side distribution
of requests.
>>>>>>>
>>>>>>> At a minimum the CQL spec should make a more clear statement
of
>>>>>>> intent and non-intent for BATCH.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>>  *From:* Jonathan Haddad <jon@jonhaddad.com>
>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rsvihla@datastax.com>
>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>
>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>> post is that batches are not there for performance.  The only
case I
>>>>>>> consider batches to be useful for is when you absolutely need
to know that
>>>>>>> several tables all get a mutation (via logged batches).  The
use case for
>>>>>>> this is when you've got multiple tables that are serving as different
views
>>>>>>> for data.  It is absolutely not going to help you if you're trying
to lump
>>>>>>> queries together to reduce network & server overhead - in
fact it'll do the
>>>>>>> opposite.  If you're trying to do that, instead perform many
async
>>>>>>> queries.  The overhead of batches in cassandra is significant
and you're
>>>>>>> going to hit a lot of problems if you use them excessively (timeouts
/
>>>>>>> failures).
>>>>>>>
>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>> async calls
>>>>>>>
>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>
>>>>>>>>  Ryan,
>>>>>>>>
>>>>>>>> Thanks for the quick response.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>> However, I didn’t see any information about why 5kb+ data
will cause
>>>>>>>> instability. 5kb or even 50kb seems too small. For example,
if each
>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you
will hit that
>>>>>>>> threshold.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In addition, Patrick is saying that he does not recommend
more than
>>>>>>>> 100 mutations per batch. So why not warn users just on the
# of mutations
>>>>>>>> in a batch?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Ryan Svihla [mailto:rsvihla@datastax.com]
>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Nothing magic, just put in there based on experience. You
can find
>>>>>>>> the story behind the original recommendation here
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>
>>>>>>>>
>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>> recommend more than ~100 mutations per batch. Doing some
quick math I came
>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>
>>>>>>>> Totally up for debate."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It's totally changeable, however, it's there in no small
part
>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>> mohammed@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>> Hi –
>>>>>>>>
>>>>>>>> The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
>>>>>>>> *
>>>>>>>>
>>>>>>>> The default size is 5kb and according to the comments in
the yaml
>>>>>>>> file, it is used to log WARN on any batch size exceeding
this value in
>>>>>>>> kilobytes. It says caution should be taken on increasing
the size of this
>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Does anybody know the significance of this magic number 5kb?
Why
>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mohammed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>
>>>>>>>> Ryan Svihla
>>>>>>>>
>>>>>>>> Solution Architect
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>> technology, delivering Apache Cassandra to the world’s
most innovative
>>>>>>>> enterprises. Datastax is built to be agile, always-on, and
predictably
>>>>>>>> scalable to any size. With more than 500 customers in 45
countries, DataStax
>>>>>>>> is the database technology and transactional backbone of
choice for the
>>>>>>>> worlds most innovative companies such as Netflix, Adobe,
Intuit, and eBay.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Ryan Svihla
>>>>
>>>> Solution Architect
>>>>
>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>>
>>>
>

Mime
View raw message