cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size
Date Wed, 02 Sep 2015 07:55:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726943#comment-14726943
] 

Benedict commented on CASSANDRA-8894:
-------------------------------------

bq. very overcomplicated 

This may well be, but I doubt it.

* Your tests only operate on SSDs; we still have many users on spinning rust, and we cannot
harm them
* Your tests only operate on very very tiny partitions. This is no longer the norm, and as
partitions go larger than 4K the performance of your simple approach will likely suffer
* You must (by calculation over your results) still have a readahead enabled, of probably
~16K, so the test you've performed is sort of abitrary. Really we should be disabling readahead
entirely on systems, and Cassandra should be making sensible decisions about the amount to
read. Note, however, that I expect this modification would simply change the results of the
simple test you ran to further reduce wasted IO.

bq. on a simple -stress test 

Generally speaking, if you want to challenge something as overcomplicated, you need to test
it in a multitude of complex scenarios, to see if the issues that complexity is intended to
solve are actually warranted. This ticket is left open to do this through further testing,
and to hopefully crank up the handle on improvements to cases we can safely do so (_without
damaging those we cannot_). Right now we're blocked on features being provided to CStar (specifically
readahead configuration. [~EnigmaCurry]: any movement on that?)

Either way, our ethos is to try and surprise our operators as little as possible. There will
undoubtedly be operators harmed by your patch, although I agree that it is likely there are
as many (perhaps more - I don't have any numbers) that would benefit. But this patch aims
to deliver the benefit to those yours would, without harming those it wouldn't.

bq. and instead issue precise reads wherever it's possible.

We cannot issue precise reads; we don't know precisely how big anything is.

bq. This over-read is causing performance problems on every Cassandra 2.1 cluster that isn't
100% writes

Thankfully this is unlikely. Most users performing uncompressed reads use mmap, and those
performing compressed reads cannot be helped here. They need CASSANDRA-8895. That said, I
personally don't see a great deal of harm in constructing some version of your simple patch
to backport, but I don't want to get involved in the specifics of that, since without extensive
_real world_ testing the decision will be quite arbitrary.

> Our default buffer size for (uncompressed) buffered reads should be smaller, and based
on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8894
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>              Labels: benedict-to-commit
>             Fix For: 3.0 alpha 1
>
>         Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml
>
>
> A large contributor to slower buffered reads than mmapped is likely that we read a full
64Kb at once, when average record sizes may be as low as 140 bytes on our stress tests. The
TLB has only 128 entries on a modern core, and each read will touch 32 of these, meaning we
are unlikely to almost ever be hitting the TLB, and will be incurring at least 30 unnecessary
misses each time (as well as the other costs of larger than necessary accesses). When working
with an SSD there is little to no benefit reading more than 4Kb at once, and in either case
reading more data than we need is wasteful. So, I propose selecting a buffer size that is
the next larger power of 2 than our average record size (with a minimum of 4Kb), so that we
expect to read in one operation. I also propose that we create a pool of these buffers up-front,
and that we ensure they are all exactly aligned to a virtual page, so that the source and
target operations each touch exactly one virtual page per 4Kb of expected record size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message