Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E7C5A11382 for ; Fri, 16 May 2014 20:48:47 +0000 (UTC) Received: (qmail 92868 invoked by uid 500); 16 May 2014 20:14:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 38809 invoked by uid 500); 16 May 2014 19:50:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 22701 invoked by uid 99); 16 May 2014 19:46:36 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 May 2014 19:46:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [66.111.4.26] (HELO out2-smtp.messagingengine.com) (66.111.4.26) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 May 2014 19:46:33 +0000 Received: from compute1.internal (compute1.nyi.mail.srv.osa [10.202.2.41]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 8252720F42 for ; Fri, 16 May 2014 15:46:08 -0400 (EDT) Received: from web5 ([10.202.2.215]) by compute1.internal (MEProxy); Fri, 16 May 2014 15:46:08 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:from:to:mime-version :content-transfer-encoding:content-type:subject:date:in-reply-to :references; s=smtpout; bh=B+DfzTDAnp+DcEUxueMptdtOrFc=; b=HqUkB fk7FSlyChckoywZTp26oD9a4Dhyebjx2cHFki9Ahhr+rU6/5rqkN2DNravV78zln b3v6CVNhUYLoUNvaSGAVTNERyH9/2/5OmPX0lBQw4GQ/jWN3petxOw5DBtNipKyA ZVxwRNv3wsZ9yUOYM/L+xkgOtsnqw0dMkXRpr4= Received: by web5.nyi.mail.srv.osa (Postfix, from userid 99) id 616D4A76E92; Fri, 16 May 2014 15:46:08 -0400 (EDT) Message-Id: <1400269568.17101.118281509.21488E2B@webmail.messagingengine.com> X-Sasl-Enc: 7EJUq7UAX5fe7WObo8m2DLDn11LkP6SAk7kwVQk9zj9W 1400269568 From: Ariel Weisberg To: user@cassandra.apache.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: multipart/alternative; boundary="_----------=_1400269568171010"; charset="utf-8" X-Mailer: MessagingEngine.com Webmail Interface - ajax-988d4021 Subject: Re: Best partition type for Cassandra with JBOD Date: Fri, 16 May 2014 15:46:08 -0400 In-Reply-To: References: <32d8851150cd4ed187c16c1cd6706356@DM2PR03MB318.namprd03.prod.outlook.com> X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --_----------=_1400269568171010 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hi, Recommending nobarrier (mount option barrier=3D0) when you don't know if a non-volatile cache in play is probably not the way to go. A non-volatile cache will typically ignore write barriers if a given block device is configured to cache writes anyways. I am also skeptical you will see a boost in performance. Applications that want to defer and batch writes won't emit write barriers frequently and when they do it's because the data has to be there. Filesystems depend on write barriers although it is surprisingly hard to get a reordering that is really bad because of the way journals are managed. Cassandra uses log structured storage and supports asynchronous periodic group commit so it doesn't need to emit write barriers frequently. Setting read ahead to zero on an SSD is necessary to get the maximum number of random reads, but will also disable prefetching for sequential reads. You need a lot less prefetching with an SSD due to the much faster response time, but it's still many microseconds. Someone with more Cassandra specific knowledge can probably give better advice as to when a non-zero read ahead make sense with Cassandra. This is something may be workload specific as well. Regards, Ariel On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote: That and nobarrier=E2=80=A6 and probably noop for the scheduler if using SSD and setting readahead to zero... On Fri, May 16, 2014 at 10:29 AM, James Campbell <[1]james@breachintelligence.com> wrote: Hi all=E2=80=94 What partition type is best/most commonly used for a multi-disk JBOD setup running Cassandra on CentOS 64bit? The datastax production server guidelines recommend XFS for data partitions, saying, =E2=80=9CBecause Cassandra can use almost half your disk space for a single file, use XFS when using large disks, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.=E2=80=9D However, the same document also notes that =E2=80=9CMaximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node,=E2=80=9D which m= akes me think >16TB file sizes would be irrelevant (especially when not using RAID to create a single large volume). What has been the experience of this group? I also noted that the guidelines don=E2=80=99t mention setting noatime and nodiratime flags in the fstab for data volumes, but I wonder if that=E2=80= =99s a common practice. James -- Founder/CEO [2]Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: [3]http://burtonator.wordpress.com =E2=80=A6 or check out my [4]Google+ profile [5][spinn3r.jpg] War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. References 1. mailto:james@breachintelligence.com 2. http://Spinn3r.com/ 3. http://burtonator.wordpress.com/ 4. https://plus.google.com/102718274791889610666/posts 5. http://spinn3r.com/ --_----------=_1400269568171010 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"
Hi,
 
Recommending nobarrier (mount option barrier=3D0) when you don't know = if a non-volatile cache in play is probably not the way to go. A non-volati= le cache will typically ignore write barriers if a given block device is co= nfigured to cache writes anyways.
 
I am also skeptical you will see a boost in performance. Applications = that want to defer and batch writes won't emit write barriers frequently an= d when they do it's because the data has to be there. Filesystems depend on= write barriers although it is surprisingly hard to get a reordering that i= s really bad because of the way journals are managed.
 
Cassandra uses log structured storage and supports asynchronous period= ic group commit so it doesn't need to emit write barriers frequently.
 
Setting read ahead to zero on an SSD is necessary to get the maximum n= umber of random reads, but will also disable prefetching for sequential rea= ds. You need a lot less prefetching with an SSD due to the much faster resp= onse time, but it's still many microseconds.
 
Someone with more Cassandra specific knowledge can probably give bette= r advice as to when a non-zero read ahead make sense with Cassandra. This i= s something may be workload specific as well.
 
Regards,
Ariel
 
On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote:
That and nobarrier=E2=80=A6 and = probably noop for the scheduler if using SSD and setting readahead to zero.= ..
 
 
On Fri, May 16, 2014 at 10:29 AM, James Campbell <jam= es@breachintelligence.com> wrote:

Hi all=E2=80=94

<= p> 

What partition type is best/most commonly = used for a multi-disk JBOD setup running Cassandra on CentOS 64bit?<= u>

 

The datastax production ser= ver guidelines recommend XFS for data partitions, saying, =E2=80=9CBecause = Cassandra can use almost half your disk space for a single file, use XFS wh= en using large disks, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on = 64-bit.=E2=80=9D

 

How= ever, the same document also notes that =E2=80=9CMaximum recommended capaci= ty for Cassandra 1.2 and later is 3 to 5TB per node,=E2=80=9D which makes m= e think >16TB file sizes would be irrelevant (especially when not using = RAID to create a single large volume).  What has been the experience of this group?

 

I also noted that the guidelines do= n=E2=80=99t mention setting noatime and nodiratime flags in the fstab for d= ata volumes, but I wonder if that=E2=80=99s a common practice.

 
James
 
 
 
--


Founder/CEO Spinn3r.com
Location: San Francisco, CA
Skype: burtonator
=E2=80=A6 or check out my Google+ profile

War is peace. Freedom is slavery. Ignoranc= e is strength. Corporations are people.


--_----------=_1400269568171010--