Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <k2z8ddbf2ee1004131048pbaa3c28sc8d23d9f45038c38@mail.gmail.com>
References: <61401.54585.qm@web111713.mail.gq1.yahoo.com>
	 <w2ze06563881004090839q7027f8f7hd858018046300029@mail.gmail.com>
	 <x2g7c5131fa1004090923v76d726fdnf105d6e5d4d3c91b@mail.gmail.com>
	 <t2ke06563881004090928k8ea950d1t9d85933cf4b74b50@mail.gmail.com>
	 <l2q8ddbf2ee1004121345s31f6510et2923757962c01932@mail.gmail.com>
	 <x2ie06563881004121351u39e08017zc325f42f533c1eef@mail.gmail.com>
	 <u2w8ddbf2ee1004121627nc6886271jaeb7fcf0d1a53ba@mail.gmail.com>
	 <n2ya7fcf8301004121759kb2edc673ge96ba9885a1b583d@mail.gmail.com>
	 <k2z8ddbf2ee1004131048pbaa3c28sc8d23d9f45038c38@mail.gmail.com>
Date: Tue, 13 Apr 2010 11:31:48 -0700
Message-ID: <j2pa7fcf8301004131131uf968c8d8rce8715b24927c85b@mail.gmail.com>
Subject: Re: Worst case #iops to read a row
From: Benjamin Black <b@b3k.us>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 13, 2010 at 10:48 AM, Time Less <timelessness@gmail.com> wrote:
>
>
>> > If I have 10B rows in my CF, and I can fit 10k rows per
>> > SStable, and the SStables are spread across 5 nodes, and I have 1 bloo=
m

The error you are making is in thinking the Memtable thresholds are
the SSTable limits.  They are not.

>> > filter false positive and 1 tombstone and ask the wrong node for the
>> > key,

Why would I ask the wrong node for the key?  I know the tokens for
every node, so I know exactly which nodes have the replicas.  If I am
asking the wrong node for a key, there is a bug.

>> > then:
>> >
>> > Mv =3D (((2B/10k)+1+1)*3)+1 =3D=3D ((200,000)+2)*3+1 =3D=3D 300,007 io=
ps to read a
>> > key.
>>
>> This is a nonsensical arrangement. =A0Assuming each SSTable is the size
>> of the default Memtable threshold (128MB), then each row is (128MB /
>> 10k) =3D=3D 12.8k and 10B rows =3D=3D 128TB of raw data. =A0A typical RF=
 of 3
>> takes us to 384TB. =A0The need for enough space for compactions takes us
>> to 768TB. =A0That's not 5 nodes, it's more like 100+, and almost 2
>> orders of magnitude off your estimate,
>
> You started off so well. You laid out a couple of useful points:
>
> (1) for a 10B-row dataset, 12.8KB rows, RF=3D3, Cassandra cluster require=
s
> 768TB. If you have less, you'll run into severe administration problems.
> This is not obvious, but is critical and extremely useful.
>

The 384TB implied by RF=3D3 and 128TB of raw data is obvious.  The
additional 384TB of space to permit worst case compaction (a
compaction of SSTables with no tombstones) might not be immediately
obvious, but does not meaningfully change the situation: even at
384TB, your 5 node assumption is way off.

> (2) 12.8KB rowsize wants a >128MB memtable threshold. Is there a rule of
> thumb for this? memTableThreshold =3D rowsize * 100,000?
>

How frequently do you want to write SSTables?  How much memory do you
want Memtables to consume?  How long do you want to wait between
Memtable flushes?  There is an entire wiki page on  Memtable tuning:
http://wiki.apache.org/cassandra/MemtableThresholds .  There is a
thorough discussion on the various tuning parameters around buffering
and writing here:
http://wiki.apache.org/cassandra/StorageConfiguration .

Do you understand you are assuming there have been no compactions,
which would be extremely bad practice given this number of SSTables?
A major compaction, as would be best practice given this volume, would
result in 1 SSTable per CF per node.  One.  Similarly, you are
assuming the update is only on the last replica checked, but the
system is going to read and write the first replica (the node that
actually has that range based on its token) first in almost all
situations.

Not worst case?  If 'we' are coming up with arbitrarily bad
situations, why not assume 1 row per SSTable, lots of tombstones, in
addition to no compactions?  Why not assume RF=3D100?  Why not assume
node failures right in the middle of your query?  The interesting
question is not 'how bad can this get if you configure and operate
things really badly?', but 'how bad can this get if you configure and
operate things according to best practices?'.


b