Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <4CA0F79A.9030101@pdf.com>
References: <4C940305.7070905@pdf.com>
	<AANLkTi=px8iT04FjePgjOrutSRoc8FuNu-BHVDTpKDtW@mail.gmail.com>
	<4C9805FD.6000802@pdf.com>
	<AANLkTin8+zVsx++32SFR18opXU35eN0b0M=Xmcm6PDeH@mail.gmail.com>
	<4C9D2015.40002@pdf.com>
	<AANLkTinEckmcRF4J1jZRsJQEmpv_efbaU19w-nK80KGF@mail.gmail.com>
	<4C9D3043.5060100@pdf.com>
	<AANLkTin6V-OgGD-gpZWzKmrqSkc2BNPbCzd+8pfakm8p@mail.gmail.com>
	<AANLkTik1JEC=Kkoq=AJR0eW2MLppYnGXuYrjYd2F4ANh@mail.gmail.com>
	<4CA0F79A.9030101@pdf.com>
Date: Mon, 27 Sep 2010 22:13:39 +0200
Message-ID: <AANLkTimoBPRgFpvdMzfnpFAOGh=A58p_uzPTEGdN=Kw+@mail.gmail.com>
Subject: Re: 0.7 memory usage problem
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

> You are saying I am doing 36000 inserts per second, when I am inserting 6=
00
> rows, I thought that every row goes into one Node, so the work is done fo=
r a
> row not a column, so my assumption is NOT true, the work is done on a col=
umn
> level? so if I reduce the number of columns I will get a "substantial"
> improvement in performance?

I am not sure about actual numbers, but in general, there is a cost on
a per-column basis. I'm pretty sure it's lower than the per-row cost,
but it is still not free and should be on the same order of magnitude
as rows (I haven't benchmarked with this in mind; someone correct me
if I'm wrong). I.e., you can't expect to put an arbitrary amount of
columns into a batch mutation, even if it affects just one row, and
not have it affect performance. Consider that individual columns are
subject to conflict resolution, consistency guarantees and to indexed
access.

Regarding reduction of column: You will definitely get a substantial
performance improvement in terms of "row mutations per second" by
decreasing the number of columns mutated in each row; however that is
comparing apples and oranges since you would be inserting less data.

> Also, what do you mean by "distributing the client load across the
> cluster",=C2=A0 I am doing the writing on Node1, the reading on Node2, an=
d the
> maintenance on Node3 (disabled the maintenance for now).
> Do you think its better if I do writes on all 3 Nodes and reads on all 3
> Nodes as well?

The typical approach to real production use is to distribute requests
to all nodes in some kind of pseudo-random/round robin fashion. Smart
clients might keep track of up/down status of nodes and possibly even
to weighting based on performance characteristics of each node. But
the general idea is - spread the load out across all nodes.

Note however that depending on actual situation it need not matter
much. For example if you are no where near CPU bound nor network bound
on the Cassandra side, evening out the load of the RPC traffic does
not make much of a difference. But as a general recommendation,
distributing the load across machines is a good default choice of
behavior.

(Of course not doing so should not cause the stack overflow you're
seeing, so this is a separate issue.)

--=20
/ Peter Schuller