incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: are counters stable enough for production?
Date Tue, 18 Sep 2012 09:48:56 GMT
Hi

@Robin, about the log message:

"Sometimes you can see log messages that indicate that counters are out of
sync in the cluster and they get "repaired". My guess would be that the
repairs actually destroys it, however I have no knowledge of the underlying
techniques. "

Here you got an answer form Sylvain Lebresne who, if I understood it well,
is in charge of cassandra counters and many other things.

http://grokbase.com/t/cassandra/user/125zr6n1q9/invalid-counter-shard-errors/1296as2dpw#1296as2dpw


If you don't have much time to read this, just know that it's a random
error, which appear with low frequency, but regularly, seems to appear
quite randomly, and nobody knows the reason why it appears yet. Also, you
need to know that it's repaired by taking the highest of the
two inconsistent values.

@Bartek, about the counters fails:

We are in a similar case, we can't afford wrong values, even more knowing
that we track the same information in different ways and we can't afford
showing different values for the same thing to our customers...

We had a lot of trouble at start using counters (counts of period
vanishing, increasing (x2 or x3 or randomly)... We finely archive to
get something stable. We still have some over-counts but it's nothing big
(it's like a 0.01% error that we can afford). It's more we could replay
some logs to rebuild our counters, but we don't do it yet, maybe someday...

We had trouble at start because of 2 things:

- Hardware not powerfull enough (we started with t1.micro from amazon then
small, medium, and now we use m1.large)
- Wrong configuration of cassandra/phpCassa (overall cassandra...) we
learnt a lot before getting a stable cluster, finally.

Get JNA installed, enough heap memory, increase timeouts in your cassandra
client (overcount is often due to timeouts, themselves often produced by
cpu highly loaded), get your cpu load down (getting more memory or
configuring it well, eventually tuning compaction_throughput_mb_per_sec and
disabling multithreaded_compaction)... I can't tell you much more about
config because there is a lot of different things you can do to improve
Cassandra performances and I'm not an expert :/.

Hope it may help somehow and you'll find out what's wrong.

Alain

2012/9/18 rohit bhatia <rohit2412@gmail.com>

> @Robin
> I'm pretty sure the GC issue is due to counters only. Since we have
> only write-heavy counter incrementing traffic.
> GC Frequency also increases linearly with write load.
>
> @Bartlomiej
> On Stress Testing, we see GC frequency and consequently write latency
> increase to several milliseconds.
> At 50k qps we had GC running every 1-2 second. And since each Parnew
> takes around 100ms, we were spending 10% of each server's time GCing.
>
> Also, we don't have persistent connections, but testing with
> persistent connections give roughly the same result.
>
> At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young
> Gen GC running on each node every 4 seconds (approximately).
> We have a young gen heap size of 3200M which is already too big by any
> standards.
>
> Also decreasing Replication factor from 2 to 1 reduced the GC
> frequency 5-6 times.
>
> Any Advice?
>
> Also, our traffic is evenly
> On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen <robin@us2.nl> wrote:
> > We've not been trying to create inconsistencies as you describe above.
> But
> > it seems legit that those situations cause problems.
> >
> > Sometimes you can see log messages that indicate that counters are out of
> > sync in the cluster and they get "repaired". My guess would be that the
> > repairs actually destroys it, however I have no knowledge of the
> underlying
> > techniques. I think this because of the fact that those read repairs
> happen
> > a lot (as you mention: lots of reads) and might get over-repaired or
> > something? However, this is all just a guess. I hope someone with a lot
> > knowledge about Cassandra internals can shed some light on this.
> >
> > Best regards,
> >
> > Robin Verlangen
> > Software engineer
> >
> > W http://www.robinverlangen.nl
> > E robin@us2.nl
> >
> > Disclaimer: The information contained in this message and attachments is
> > intended solely for the attention and use of the named addressee and may
> be
> > confidential. If you are not the intended recipient, you are reminded
> that
> > the information remains the property of the sender. You must not use,
> > disclose, distribute, copy, print or rely on this e-mail. If you have
> > received this message in error, please contact the sender immediately and
> > irrevocably delete this message and any copies.
> >
> >
> >
> > 2012/9/18 Bartłomiej Romański <br@sentia.pl>
> >>
> >> Garbage is one more issue we are having with counters. We are
> >> operating under very heavy load. Counters are spread over 7 nodes with
> >> SSD drives and we often seeing CPU usage between 90-100%. We are doing
> >> mostly reads. Latency is very important for us so GC pauses taking
> >> longer than 10ms (often around 50-100ms) are very annoying.
> >>
> >> I don't have actual numbers right now, but we've also got the
> >> impressions that cassandra generates "too much" garbage. Is there a
> >> possible that counters are somehow guilty?
> >>
> >> @Rohit: Did you tried something more stressful? Like sending more
> >> traffic to a node that it can actually handle, turning nodes up and
> >> down, changing the topology (moving/adding nodes)? I believe our
> >> problems comes from very high load and some operations like this
> >> (adding new nodes, replacing dead ones etc...). I was expecting that
> >> cassandra will fail some request, loose consistency temporarily or
> >> something like that in such cases, but generation highly incorrect
> >> values was very disappointing.
> >>
> >> Thanks,
> >> Bartek
> >>
> >>
> >> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <robin@us2.nl> wrote:
> >> > @Rohit: We also use counters quite a lot (lets say 2000 increments /
> >> > sec),
> >> > but don't see the 50-100KB of garbage per increment. Are you sure that
> >> > memory is coming from your counters?
> >> >
> >> > Best regards,
> >> >
> >> > Robin Verlangen
> >> > Software engineer
> >> >
> >> > W http://www.robinverlangen.nl
> >> > E robin@us2.nl
> >> >
> >> > Disclaimer: The information contained in this message and attachments
> is
> >> > intended solely for the attention and use of the named addressee and
> may
> >> > be
> >> > confidential. If you are not the intended recipient, you are reminded
> >> > that
> >> > the information remains the property of the sender. You must not use,
> >> > disclose, distribute, copy, print or rely on this e-mail. If you have
> >> > received this message in error, please contact the sender immediately
> >> > and
> >> > irrevocably delete this message and any copies.
> >> >
> >> >
> >> >
> >> > 2012/9/18 rohit bhatia <rohit2412@gmail.com>
> >> >>
> >> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
> >> >> We use phpcassa and execute cql queries through thrift to work with
> >> >> composite types.
> >> >>
> >> >> We do not have any problem of overcounts as we tally with RDBMS
> daily.
> >> >>
> >> >> It works fine but we are having some GC pressure for young
> generation.
> >> >> Per my calculation around 50-100 KB of garbage is generated every
> >> >> counter increment.
> >> >> Is this memory usage expected of counters?
> >> >>
> >> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br@sentia.pl>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Does anyone have any experience with using Cassandra counters
in
> >> >> > production?
> >> >> >
> >> >> > We rely heavily on them and recently we've got a few very serious
> >> >> > problems. Our counters values suddenly became a few times higher
> than
> >> >> > expected. From the business point of view this is a disaster :/
> Also
> >> >> > there a few open major bugs related to them. Some of them for
quite
> >> >> > long (months).
> >> >> >
> >> >> > We are seriously considering going back to other solutions (e.g.
> SQL
> >> >> > databases). We simply cannot afford incorrect counter values.
We
> can
> >> >> > tolerate loosing a few increments from time to time, but we cannot
> >> >> > tolerate having counters suddenly 3 times higher or lower than
the
> >> >> > expected values.
> >> >> >
> >> >> > What is the current status of counters? Should I consider them
a
> >> >> > production-ready feature and we just have some bad luck? Or should
> I
> >> >> > rather consider them as a experimental-feature and look for some
> >> >> > other
> >> >> > solutions?
> >> >> >
> >> >> > Do you have any experiences with them? Any comments would be very
> >> >> > helpful for us!
> >> >> >
> >> >> > Thanks,
> >> >> > Bartek
> >> >
> >> >
> >
> >
>

Mime
View raw message