Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9944CDBAC for ; Thu, 25 Oct 2012 01:05:46 +0000 (UTC) Received: (qmail 39661 invoked by uid 500); 25 Oct 2012 01:05:44 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39626 invoked by uid 500); 25 Oct 2012 01:05:44 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39618 invoked by uid 99); 25 Oct 2012 01:05:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2012 01:05:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a52.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2012 01:05:36 +0000 Received: from homiemail-a52.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a52.g.dreamhost.com (Postfix) with ESMTP id 3961C6B8269 for ; Wed, 24 Oct 2012 18:05:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=MZr66LMZEkUgMNsJ/kNN814c35 Q=; b=s8kKfxTwkB62H/izjY235x5d3KQyKCatIq4q676zoQ/kGzHnyAXCZD42W0 CSK1UlBMszYs9VoJMjuqn/9I5Y4Kj7xmjzZ1ACmUBj7tzhNAezWAYC4KmH6ZAfu7 xAC9iADIvHF/3eyQorjoEP5uObVnGnsDjA/NC5bzluJhlzbyM= Received: from [192.168.2.77] (unknown [116.90.132.105]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a52.g.dreamhost.com (Postfix) with ESMTPSA id 8A1006B8227 for ; Wed, 24 Oct 2012 18:05:17 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_004AACD4-F0FD-44A0-B824-1A65A7534D2B" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Subject: Re: Hinted Handoff storage inflation Date: Thu, 25 Oct 2012 14:05:13 +1300 References: <1C115F46-280B-495B-B361-1D6EFE059842@yahoo-inc.com> To: user@cassandra.apache.org In-Reply-To: <1C115F46-280B-495B-B361-1D6EFE059842@yahoo-inc.com> X-Mailer: Apple Mail (2.1498) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_004AACD4-F0FD-44A0-B824-1A65A7534D2B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hints store the columns, row key, KS name and CF id(s) for each mutation = to each node. Where an executed mutation will store the most recent = columns collated with others under the same row key. So depending on the = type of mutation hints will take up more space.=20 The worse case would be lots of overwrites. After that writing a small = amount of data to many rows would result in a lot of the serialised = space being devoted to row keys, KS name and CF id. 16Gb is a lot though. What was the write workload like ? You can get an estimate on the number of keys in the Hints CF using = nodetool cfstats. Also some metrics in the JMX will tell you how many = hints are stored.=20 > This has a huge impact on write performance as well. Yup. Hints are added to the same Mutation thread pool as normal = mutations. They are processed async to the mutation request but they = still take resources to store.=20 You can adjust how long hints a collected for with max_hint_window_in_ms = in the yaml file.=20 How long did the test run for ?=20 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 11:26 AM, Mattias Larsson = wrote: >=20 > I'm testing various scenarios in a multi data center configuration. = The setup is 10 Cassandra 1.1.5 nodes configured into two data centers, = 5 nodes in each DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I = have a synthetic random data generator that I can run, and each run adds = roughly 1GiB of data to each node per run, >=20 > DC Rack Status State Load = Effective-Ownership >=20 > DC1 RAC1 Up Normal 1010.71 MB 60.00% = =20 > DC2 RAC1 Up Normal 1009.08 MB 60.00% = =20 > DC1 RAC1 Up Normal 1.01 GB 60.00% = =20 > DC2 RAC1 Up Normal 1 GB 60.00% = =20 > DC1 RAC1 Up Normal 1.01 GB 60.00% = =20 > DC2 RAC1 Up Normal 1014.45 MB 60.00% = =20 > DC1 RAC1 Up Normal 1.01 GB 60.00% = =20 > DC2 RAC1 Up Normal 1.01 GB 60.00% = =20 > DC1 RAC1 Up Normal 1.01 GB 60.00% = =20 > DC2 RAC1 Up Normal 1.01 GB 60.00% = =20 >=20 > Now, if I kill all the nodes in DC2, and run the data generator again, = I would expect roughly 2GiB to be added to each node in DC1 (local = replicas + hints to other data center), instead I get this: >=20 > DC Rack Status State Load = Effective-Ownership >=20 > DC1 RAC1 Up Normal 17.56 GB 60.00% = =20 > DC2 RAC1 Down Normal 1009.08 MB 60.00% = =20 > DC1 RAC1 Up Normal 17.47 GB 60.00% = =20 > DC2 RAC1 Down Normal 1 GB 60.00% = =20 > DC1 RAC1 Up Normal 17.22 GB 60.00% = =20 > DC2 RAC1 Down Normal 1014.45 MB 60.00% = =20 > DC1 RAC1 Up Normal 16.94 GB 60.00% = =20 > DC2 RAC1 Down Normal 1.01 GB 60.00% = =20 > DC1 RAC1 Up Normal 17.26 GB 60.00% = =20 > DC2 RAC1 Down Normal 1.01 GB 60.00% = =20 >=20 > Checking the sstables on a node reveals this, >=20 > -bash-3.2$ du -hs HintsColumnFamily/ > 16G HintsColumnFamily/ > -bash-3.2$ >=20 > So it seems that what I would have expected to be 1GiB of hints is = much larger in reality, a 15x-16x inflation. This has a huge impact on = write performance as well. >=20 > If I bring DC2 up again, eventually the load will drop down and even = out to 2GiB across the entire cluster. >=20 > I'm wondering if this inflation is intended or if it is possibly a bug = or something I'm doing wrong? Assuming this inflation is correct, what = is the best way to deal with temporary connectivity issues with a second = data center? Write performance is paramount in my use case. A 2x-3x = overhead is doable, but not 15x-16x. >=20 > Thanks, > /dml >=20 >=20 --Apple-Mail=_004AACD4-F0FD-44A0-B824-1A65A7534D2B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii Hints = store the columns, row key, KS name and CF id(s) for each mutation to = each node. Where an executed mutation will store the most recent columns = collated with others under the same row key. So depending on the type of = mutation hints will take up more space. 

The = worse case would be lots of overwrites. After that writing a small = amount of data to many rows would result in a lot of the serialised = space being devoted to row keys, KS name and CF = id.

16Gb is a lot though. What was the write = workload like ?
You can get an estimate on the number of keys = in the Hints CF using nodetool cfstats. Also some metrics in the JMX = will tell you how many hints are = stored. 

This = has a huge impact on write performance as well.
Yup. Hints = are added to the same Mutation thread pool as normal mutations. They are = processed async to the mutation request but they still take resources to = store. 

You can adjust how long hints a = collected for with max_hint_window_in_ms in the yaml = file. 

How long did the test run for = ? 


Cheers

=
http://www.thelastpickle.com

On 25/10/2012, at 11:26 AM, Mattias Larsson <mlarsson@yahoo-inc.com> = wrote:


I'm testing various scenarios in a multi data center = configuration. The setup is 10 Cassandra 1.1.5 nodes configured into two = data centers, 5 nodes in each DC (RF DC1:3,DC2:3, write consistency = LOCAL_QUORUM). I have a synthetic random data generator that I can run, = and each run adds roughly 1GiB of data to each node per run,

DC =          Rack =        Status State   Load =            Effectiv= e-Ownership

DC1 =         RAC1 =        Up =     Normal  1010.71 MB =      60.00% =             DC2         RAC1 =        Up =     Normal  1009.08 MB =      60.00% =             DC1         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             DC2         RAC1 =        Up =     Normal  1 GB =            60.00% =             DC1         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             DC2         RAC1 =        Up =     Normal  1014.45 MB =      60.00% =             DC1         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             DC2         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             DC1         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             DC2         RAC1 =        Up =     Normal  1.01 GB =         60.00% =             
Now, if I kill all the nodes in DC2, and run the data generator = again, I would expect roughly 2GiB to be added to each node in DC1 = (local replicas + hints to other data center), instead I get = this:

DC =          Rack =        Status State   Load =            Effectiv= e-Ownership

DC1 =         RAC1 =        Up =     Normal  17.56 GB =        60.00% =             DC2         RAC1 =        Down   Normal =  1009.08 MB      60.00% =             DC1         RAC1 =        Up =     Normal  17.47 GB =        60.00% =             DC2         RAC1 =        Down   Normal =  1 GB =            60.00% =             DC1         RAC1 =        Up =     Normal  17.22 GB =        60.00% =             DC2         RAC1 =        Down   Normal =  1014.45 MB      60.00% =             DC1         RAC1 =        Up =     Normal  16.94 GB =        60.00% =             DC2         RAC1 =        Down   Normal =  1.01 GB         60.00% =             DC1         RAC1 =        Up =     Normal  17.26 GB =        60.00% =             DC2         RAC1 =        Down   Normal =  1.01 GB         60.00% =             
Checking the sstables on a node reveals this,

-bash-3.2$ du = -hs HintsColumnFamily/
16G = HintsColumnFamily/
-bash-3.2$

So it seems that what I = would have expected to be 1GiB of hints is much larger in reality, a = 15x-16x inflation. This has a huge impact on write performance as = well.

If I bring DC2 up again, eventually the load will drop down = and even out to 2GiB across the entire cluster.

I'm wondering if = this inflation is intended or if it is possibly a bug or something I'm = doing wrong? Assuming this inflation is correct, what is the best way to = deal with temporary connectivity issues with a second data center? Write = performance is paramount in my use case. A 2x-3x overhead is doable, but = not = 15x-16x.

Thanks,
/dml



= --Apple-Mail=_004AACD4-F0FD-44A0-B824-1A65A7534D2B--