Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of sylvain@datastax.com
 designates 209.85.216.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <19409DD6-7DAC-446D-9781-AAB7F2FA822F@gmx.net>
References: <19409DD6-7DAC-446D-9781-AAB7F2FA822F@gmx.net>
Date: Mon, 23 May 2011 19:48:26 +0200
Message-ID: <BANLkTi=WjMuqfLAoVmdf6KJWG3kLkfyCGA@mail.gmail.com>
Subject: Re: repair question
From: Sylvain Lebresne <sylvain@datastax.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Mon, May 23, 2011 at 7:17 PM, Daniel Doubleday
<daniel.doubleday@gmx.net> wrote:
> Hi all
>
> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't really work the way I expected but I thought that would be a bug which only affects that special case.
>
> So I tried again for all CFs.
>
> I started with a nicely compacted machine with around 320GB of load. Total disc space on this node was 1.1TB.
>
> After it went out of disc space (meaning I received around 700GB of data) I had a very brief look at the repair code again and it seems to me that the repairing node will get all data for its range from all its neighbors.

The repaired node is supposed to get only data from it's
neighbors for rows it is not in sync with. That is all supposed
to depend on how much the node is out of sync compared to
the other nodes.

Now there is a number of things that could make it repair more
that what you would hope. For instance:
  1) even if one column is different for a row, the full row is
      repaired. If you have a small number of huge rows, that
      can amount for quite some data useless transfered.
  2) The other one is that the merkle tree (that allows to say
      whether 2 rows are in sync) doesn't necessarily have one
      hash by row, so in theory one column not in sync may imply
      the repair of more than one row.
  3) https://issues.apache.org/jira/browse/CASSANDRA-2324 (which
      is fixed in 0.8)

Fortunately, the chance to get hit by 1) is proportionally inverse
to the change of getting hit by 2) and vice versa.

Anyway, the kind of excess data your seeing is not something
I would expect unless the node is really completely out of sync
with all the other nodes.
So in the light of this, do you have more info on your own case ?
(do you lots of small row, few of large ones ? Did you expected
the node to be widely out of sync with the other nodes ? Etc..)


--
Sylvain

>
> Is that true and if so is it the intended behavior? If so one would rather need 5-6 times of disc space given that compactions that need to run after the sstable rebuild also need temp disc space.
>
> Cheers,
> Daniel