cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size
Date Mon, 06 Jul 2015 11:07:05 GMT


Benedict commented on CASSANDRA-8894:

So, I've been thinking a little more on the maths, and I'm no longer convinced the \* 4 makes
as much sense as I did as first (when I hadn't thought it through fully). 

Half of the multiple was to account for non-average sized rows, however we can instead pick
the 95%ile from the EsimtatedHistogram, say.

The other half was to ensure we don't cross a page boundary, but this actually won't achieve
that at all. For this we need to double the _number of pages_. I think here we need to consider
having different strategies for spinning disks vs SSDs, and introducing a yaml property for
letting us know which to optimise for. For spinning disks, we probably always want to read
at least one extra page. For SSDs, we probably want to read an extra page only if there is
>, say, a 10% chance of crossing the page boundary for our read; otherwise we may as well
do the extra read as necessary.

Whether or not we want to expose these knobs to the users is a good question. They're probably
pretty useful knobs, but also pretty unlikely to be tweaked by almost anybody. So if we do,
we should probably not put them in the yaml file by default.

WDYT [~stefania]?

To make matters more unpleasant for you, it would be helpful to run a number of performance
comparisons. But trunk simply is not in a fit state for this kind of comparison right now,
so we may need to rebase this against 2.2 even though we won't release it, so that we can
tease out some performance characteristics. We can repeat this work at a later date once trunk
has settled down if we so desire, as duplicating the cstar benchmarks should be quite easy,
however this work should be pretty orthogonal to trunk tuning.

It might be good to see the behaviour of unpatched vs your current patch vs my proposed changes,
for different ranges of row sizes (say, all tiny; exponential distribution with size range
from 5% of a page to a few pages; and from say 25% of a page to a few pages). Particularly
on an SSD machine. We can keep the profiles pretty simple by just making each partition a
blob of random size with the desired range.

> Our default buffer size for (uncompressed) buffered reads should be smaller, and based
on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-8894
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>              Labels: benedict-to-commit
>             Fix For: 3.x
> A large contributor to slower buffered reads than mmapped is likely that we read a full
64Kb at once, when average record sizes may be as low as 140 bytes on our stress tests. The
TLB has only 128 entries on a modern core, and each read will touch 32 of these, meaning we
are unlikely to almost ever be hitting the TLB, and will be incurring at least 30 unnecessary
misses each time (as well as the other costs of larger than necessary accesses). When working
with an SSD there is little to no benefit reading more than 4Kb at once, and in either case
reading more data than we need is wasteful. So, I propose selecting a buffer size that is
the next larger power of 2 than our average record size (with a minimum of 4Kb), so that we
expect to read in one operation. I also propose that we create a pool of these buffers up-front,
and that we ensure they are all exactly aligned to a virtual page, so that the source and
target operations each touch exactly one virtual page per 4Kb of expected record size.

This message was sent by Atlassian JIRA

View raw message