Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E78E7113A7 for ; Mon, 19 May 2014 18:54:19 +0000 (UTC) Received: (qmail 83745 invoked by uid 500); 19 May 2014 18:54:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83713 invoked by uid 500); 19 May 2014 18:54:17 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83705 invoked by uid 99); 19 May 2014 18:54:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 18:54:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: unknown (nike.apache.org: error in processing during lookup of bryan.talbot@playnext.com) Received: from [209.85.223.179] (HELO mail-ie0-f179.google.com) (209.85.223.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 18:54:15 +0000 Received: by mail-ie0-f179.google.com with SMTP id rd18so2518256iec.24 for ; Mon, 19 May 2014 11:53:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=tR+6sdnMKx/7ytP99+OO0bEw7697JB9toZmuu3PUlAY=; b=TQiivdtWXPJhxVMbYyCuzF3po0rAsZlTT1sgcOvsSy7u8a3WY7yxJr2w5xU9rM0mgU O89TtX1UFP6v5z2W6Ec5FHl1t/GFpRSdWlvs+bQCGMyvd5nkQIHwwNQEW2KQseeOHBvB NRIC27UULxFuEuhGbCqCajtsq7OASYqW4SFpvNsOpEd7k8SzYKFxEvMBmk2yvrQfqJ8X mPLxchOmh78Oo9jC22bv3UAQLha9ABFbc+qhWqf5zQFI9O9Pkwq7FLclVQ3C2DjaOVHk IQfd4vFeocEWi4RR+8Phy4MJiJzKLcJdMxYwbs6akKvBT3Aw++FpIDIIniXH/hEZ6Hy8 sQrw== X-Gm-Message-State: ALoCoQm6g/5wSj4rlSjF42GxE0K6iSYXP5c+4obgokfjfT9B8vt/RxymcbcqF45T1TxwiQ0eafHK MIME-Version: 1.0 X-Received: by 10.42.35.198 with SMTP id r6mr35661240icd.45.1400525631312; Mon, 19 May 2014 11:53:51 -0700 (PDT) Received: by 10.42.128.19 with HTTP; Mon, 19 May 2014 11:53:51 -0700 (PDT) In-Reply-To: References: <32d8851150cd4ed187c16c1cd6706356@DM2PR03MB318.namprd03.prod.outlook.com> <1400269568.17101.118281509.21488E2B@webmail.messagingengine.com> Date: Mon, 19 May 2014 11:53:51 -0700 Message-ID: Subject: Re: Best partition type for Cassandra with JBOD From: Bryan Talbot To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=90e6ba25ea3bc0e71004f9c548ad X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba25ea3bc0e71004f9c548ad Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable For XFS, using noatime and nodirtime isn't really useful either. http://xfs.org/index.php/XFS_FAQ#Q:_Is_using_noatime_or.2Fand_nodiratime_at= _mount_time_giving_any_performance_benefits_in_xfs_.28or_not_using_them_per= formance_decrease.29.3F On Sat, May 17, 2014 at 7:52 AM, James Campbell < james@breachintelligence.com> wrote: > Thanks for the thoughts! > On May 16, 2014 4:23 PM, Ariel Weisberg wrote: > Hi, > > Recommending nobarrier (mount option barrier=3D0) when you don't know if = a > non-volatile cache in play is probably not the way to go. A non-volatile > cache will typically ignore write barriers if a given block device is > configured to cache writes anyways. > > I am also skeptical you will see a boost in performance. Applications tha= t > want to defer and batch writes won't emit write barriers frequently and > when they do it's because the data has to be there. Filesystems depend on > write barriers although it is surprisingly hard to get a reordering that = is > really bad because of the way journals are managed. > > Cassandra uses log structured storage and supports asynchronous periodic > group commit so it doesn't need to emit write barriers frequently. > > Setting read ahead to zero on an SSD is necessary to get the maximum > number of random reads, but will also disable prefetching for sequential > reads. You need a lot less prefetching with an SSD due to the much faster > response time, but it's still many microseconds. > > Someone with more Cassandra specific knowledge can probably give better > advice as to when a non-zero read ahead make sense with Cassandra. This i= s > something may be workload specific as well. > > Regards, > Ariel > > On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote: > > That and nobarrier=E2=80=A6 and probably noop for the scheduler if using = SSD and > setting readahead to zero... > > > On Fri, May 16, 2014 at 10:29 AM, James Campbell < > james@breachintelligence.com> wrote: > > Hi all=E2=80=94 > > > > What partition type is best/most commonly used for a multi-disk JBOD setu= p > running Cassandra on CentOS 64bit? > > > > The datastax production server guidelines recommend XFS for data > partitions, saying, =E2=80=9CBecause Cassandra can use almost half your d= isk space > for a single file, use XFS when using large disks, particularly if using = a > 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and > essentially unlimited on 64-bit.=E2=80=9D > > > > However, the same document also notes that =E2=80=9CMaximum recommended c= apacity > for Cassandra 1.2 and later is 3 to 5TB per node,=E2=80=9D which makes me= think > >16TB file sizes would be irrelevant (especially when not using RAID to > create a single large volume). What has been the experience of this grou= p? > > > > I also noted that the guidelines don=E2=80=99t mention setting noatime an= d > nodiratime flags in the fstab for data volumes, but I wonder if that=E2= =80=99s a > common practice. > > James > > > > > -- > > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > Skype: *burtonator* > blog: http://burtonator.wordpress.com > =E2=80=A6 or check out my Google+ profile > > War is peace. Freedom is slavery. Ignorance is strength. Corporations > are people. > > > --=20 Bryan Talbot Architect / Platform team lead, Aeria Games and Entertainment Silicon Valley | Berlin | Tokyo | Sao Paulo --90e6ba25ea3bc0e71004f9c548ad Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

On Sat, May 17, 2014 at 7:52 AM, James Campbel= l <james@breachintelligence.com> wrote:

Thanks for the thoughts!

On May 16, 2014 4:23 PM, Ariel Weisberg <ariel@weisberg.ws> wrote:
Hi,
=C2=A0
Recommending nobarrier (mount option barrier=3D0) when you don't k= now if a non-volatile cache in play is probably not the way to go. A non-vo= latile cache will typically ignore write barriers if a given block device i= s configured to cache writes anyways.
=C2=A0
I am also skeptical you will see a boost in performance. Applications = that want to defer and batch writes won't emit write barriers frequentl= y and when they do it's because the data has to be there. Filesystems d= epend on write barriers although it is surprisingly hard to get a reordering that is really bad because of the way journals ar= e managed.
=C2=A0
Cassandra uses log structured storage and supports asynchronous period= ic group commit so it doesn't need to emit write barriers frequently.
=C2=A0
Setting read ahead to zero on an SSD is necessary to get the maximum n= umber of random reads, but will also disable prefetching for sequential rea= ds. You need a lot less prefetching with an SSD due to the much faster resp= onse time, but it's still many microseconds.
=C2=A0
Someone with more Cassandra specific knowledge can probably give bette= r advice as to when a non-zero read ahead make sense with Cassandra. This i= s something may be workload specific as well.
=C2=A0
Regards,
Ariel
=C2=A0
On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote:
That and nobarrier=E2=80=A6 and probably noop for the sche= duler if using SSD and setting readahead to zero...
=C2=A0
=C2=A0
On Fri, May 16, 2014 at 10:29 AM, James Campbell <= ;james@br= eachintelligence.com> wrote:

Hi all=E2=80=94

=C2=A0

What partition type is best/most commonly used for a multi-disk JBOD set= up running Cassandra on CentOS 64bit?

=C2=A0

The datastax production server guidelines recommend XFS for data partiti= ons, saying, =E2=80=9CBecause Cassandra can use almost half your disk space= for a single file, use XFS when using large disks, particularly if using a= 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.=E2=80=9D=

=C2=A0

However, the same document also notes that =E2=80=9CMaximum recommended = capacity for Cassandra 1.2 and later is 3 to 5TB per node,=E2=80=9D which m= akes me think >16TB file sizes would be irrelevant (especially when not = using RAID to create a single large volume).=C2=A0 What has been the experience of this group?

=C2=A0

I also noted that the guidelines don=E2=80=99t mention setting noatime a= nd nodiratime flags in the fstab for data volumes, but I wonder if that=E2= =80=99s a common practice.

=C2=A0
James
=C2=A0
=C2=A0
=C2=A0
--


Founder/CEO=C2=A0Spin= n3r.com
Location:=C2=A0San Francisco, CA
Skype:=C2=A0burtonator
=E2=80=A6 or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.





--
Bryan Talbot=
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

--90e6ba25ea3bc0e71004f9c548ad--