Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of scode@scode.org designates
 74.125.82.44 as permitted sender)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: 
 <CAAnh3_97NGkMcB5nAtLaVsH2ROgkAry-b1A8bO0qSs2-TUypzg@mail.gmail.com>
References: 
 <CAAnh3_97NGkMcB5nAtLaVsH2ROgkAry-b1A8bO0qSs2-TUypzg@mail.gmail.com>
Date: Sun, 25 Sep 2011 12:59:19 +0200
Message-ID: 
 <CAO5xsd00XNt7y384vJCsmyqDjqieSFQofxCuQxe=Xj4v7+sFkA@mail.gmail.com>
Subject: Re: CMS GC initial-mark taking 6 seconds , bad?
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

> I see the following in my GC log
>
> 1910.513: [GC [1 CMS-initial-mark: 2598619K(26214400K)]
> 13749939K(49807360K), 6.0696680 secs] [Times: user=3D6.10 sys=3D0.00,
> real=3D6.07 secs]
>
> so there is a stop-the-world period of 6 seconds. does this sound bad
> ? or 6 seconds is OK =C2=A0and we should expect the built-in
> fault-tolerance of Cassandra handle this?

initial-mark pauses are stop-the-world, so a 6 second initial-mark
would have paused the node for those 6 seconds.

The initial mark is essentially marking roots for old-gen; that should
include thread stacks and such, but will also include younger
generations. You might read [1] which talks a bit about it; a
recommendation there is to make sure that initial marks happen right
after a young-gen collection, and they advise increasing heap size
sufficiently to allow an ininitial mark to trigger (I suppose by
heuristics) after the young gen collection, prior to the CMS trigger.
It makes sense, especially given that initial-mark is single-threaded,
to try do to that (and leave the young-gen smaller, collected by the
parallel collector). However I'm not entirely clear on what VM options
are required for this. I had a brief look at the code but it wasn't
obvious at cursory glance under what circumstances an initial mark is
triggered right after young-gen vs. not. In your case you clearly have
enough heap.

Can you correlate with ParNew collections and see if the initial mark
pauses seem to happen immediately after a ParNew, or somewhere in
between, in the cases where they take this long?

Also, as a mitigationg: What's your young generation size? One way to
mitigate the problem, if it is indeed the young gen marking that is
taking time, is to decrease the size of the young generation to leave
less work for initial marking. Normally the young gen is sized based
on expected pause times given parallel ParNew ollections, but if a
non-parallel initial-mark is having to do marking of the same contents
the pause time could be higher (hence the discussion above).

Also, is each initial mark this long, or is that something that
happens once in a while?

As for Cassandra dealing with it: It is definitely not a good thing to
have 6 second pauses. Even with all other nodes up, it takes time for
the dynamic snitch to realize what's going on and you will tend to see
a subset of requests to the cluster get 'stuck' in circumstances like
that. Also, if you're e.g. doing QUORUM at RF=3D3, if a node is down for
legitimate reasons, another node having a 6 second pause will by
necessity cause high latency for requests during that period.

[1] http://answerpot.com/showthread.php?1558705-CMS+initial+mark+pauses


--=20
/ Peter Schuller (@scode on twitter)