Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 8 Jul 2015 01:05:04 +0000 (UTC)
From: "Stefania (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12779092.1425394853000.125177.1436317504931@Atlassian.JIRA>
In-Reply-To: <JIRA.12779092.1425394853000@Atlassian.JIRA>
References: <JIRA.12779092.1425394853000@Atlassian.JIRA>
 <JIRA.12779092.1425394853347@arcas>
Subject: [jira] [Commented] (CASSANDRA-8894) Our default buffer size for
 (uncompressed) buffered reads should be smaller, and based on the expected
 record size
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-8894?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14617778#comment-14617778 ]=20

Stefania commented on CASSANDRA-8894:
-------------------------------------

Yes I assumed a normal distribution of the record size. Your suggestion of =
a uniform distribution of _start position_ within a page is more straight-f=
orward however. Let's start with that: {{size}} =3D 95 percentile, chance o=
f crossing =3D {{(size % 4096) / 4096}}

Noted about adding size percentile and chance of crossing threshold to the =
config without mention in the yaml. I'll also add a *global* setting to ind=
icate if the data directories are SSD or spinning disk, and this will inste=
ad be in the yaml.=20

> Our default buffer size for (uncompressed) buffered reads should be small=
er, and based on the expected record size
> -------------------------------------------------------------------------=
-----------------------------------------
>
>                 Key: CASSANDRA-8894
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>              Labels: benedict-to-commit
>             Fix For: 3.x
>
>
> A large contributor to slower buffered reads than mmapped is likely that =
we read a full 64Kb at once, when average record sizes may be as low as 140=
 bytes on our stress tests. The TLB has only 128 entries on a modern core, =
and each read will touch 32 of these, meaning we are unlikely to almost eve=
r be hitting the TLB, and will be incurring at least 30 unnecessary misses =
each time (as well as the other costs of larger than necessary accesses). W=
hen working with an SSD there is little to no benefit reading more than 4Kb=
 at once, and in either case reading more data than we need is wasteful. So=
, I propose selecting a buffer size that is the next larger power of 2 than=
 our average record size (with a minimum of 4Kb), so that we expect to read=
 in one operation. I also propose that we create a pool of these buffers up=
-front, and that we ensure they are all exactly aligned to a virtual page, =
so that the source and target operations each touch exactly one virtual pag=
e per 4Kb of expected record size.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)