Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of me@matthiasb.com designates
 209.85.215.44 as permitted sender)
MIME-Version: 1.0
From: Matthias Broecheler <me@matthiasb.com>
Date: Mon, 15 Oct 2012 13:42:59 -0700
Message-ID: 
 <CAEsQWxoesptSP3+M5R32cUVXQNDYq7QnOSzHFA1Rpu54y90o_A@mail.gmail.com>
Subject: RF update
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=f46d042dfc957249de04cc1f167c

--f46d042dfc957249de04cc1f167c
Content-Type: text/plain; charset=ISO-8859-1

Hey,

we are writing a lot of data into a cassandra cluster for a batch loading
use case. We cannot use the sstable batch loader, so in order to speed up
the loading process we are using RF=1 while the data is loading. After the
load is complete, we want to increase the RF. For that, we are updating the
RF in the schema and then run the node repair tool on each cassandra
instance to stream the data over. However, we are noticing that this
process is slowed down by a lot of compactions (the actually streaming of
data only takes a couple of minutes).

Cassandra is already running a major compaction after the data loading
process has completed. But then, there are to be two more compactions (one
on the sender and one on the receiver) happening and those take a very long
time even on the aws high i/o instance with no compaction throttling.

Question: These additional compactions seem redundant since there are no
reads or writes on the cluster after the first major compaction
(immediately after the data load), is that right? And if so, what can we do
to avoid them? We are currently waiting multiple days.

Thank you very much for your help,
Matthias

--f46d042dfc957249de04cc1f167c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hey,<div><br></div><div>we are writing a lot of data into a cassandra clust=
er for a batch loading use case. We cannot use the sstable batch loader, so=
 in order to speed up the loading process we are using RF=3D1 while the dat=
a is loading. After the load is complete, we want to increase the RF. For t=
hat, we are updating the RF in the schema and then run the node repair tool=
 on each cassandra instance to stream the data over. However, we are notici=
ng that this process is slowed down by a lot of compactions (the actually s=
treaming of data only takes a couple of minutes).</div>

<div><br></div><div>Cassandra is already running a major compaction after t=
he data loading process has completed. But then, there are to be two more c=
ompactions (one on the sender and one on the receiver) happening and those =
take a very long time even on the aws high i/o instance with no compaction =
throttling.=A0</div>

<div><br></div><div>Question: These additional compactions seem redundant s=
ince there are no reads or writes on the cluster after the first major comp=
action (immediately after the data load), is that right? And if so, what ca=
n we do to avoid them? We are currently waiting multiple days.</div>

<div><br></div><div>Thank you very much for your help,</div><div>Matthias<b=
r clear=3D"all"><div><br></div>
</div>

--f46d042dfc957249de04cc1f167c--