incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Verlangen <ro...@us2.nl>
Subject Re: are counters stable enough for production?
Date Tue, 18 Sep 2012 10:03:54 GMT
@Alain: " If you don't have much time to read this, just know that it's a
random error, which appear with low frequency, but regularly, seems to
appear quite randomly, and nobody knows the reason why it appears yet.
Also, you need to know that it's repaired by taking the highest of the
two inconsistent values."

I was aware of that. The repair of taking the highest value of two
inconsistent might cause getting higher values? Maybe even much higher
values if it takes place multiple times?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/18 Alain RODRIGUEZ <arodrime@gmail.com>

> Hi
>
> @Robin, about the log message:
>
> "Sometimes you can see log messages that indicate that counters are out of
> sync in the cluster and they get "repaired". My guess would be that the
> repairs actually destroys it, however I have no knowledge of the underlying
> techniques. "
>
> Here you got an answer form Sylvain Lebresne who, if I understood it well,
> is in charge of cassandra counters and many other things.
>
>
> http://grokbase.com/t/cassandra/user/125zr6n1q9/invalid-counter-shard-errors/1296as2dpw#1296as2dpw
>
>
> If you don't have much time to read this, just know that it's a random
> error, which appear with low frequency, but regularly, seems to appear
> quite randomly, and nobody knows the reason why it appears yet. Also, you
> need to know that it's repaired by taking the highest of the
> two inconsistent values.
>
> @Bartek, about the counters fails:
>
> We are in a similar case, we can't afford wrong values, even more knowing
> that we track the same information in different ways and we can't afford
> showing different values for the same thing to our customers...
>
> We had a lot of trouble at start using counters (counts of period
> vanishing, increasing (x2 or x3 or randomly)... We finely archive to
> get something stable. We still have some over-counts but it's nothing big
> (it's like a 0.01% error that we can afford). It's more we could replay
> some logs to rebuild our counters, but we don't do it yet, maybe someday...
>
> We had trouble at start because of 2 things:
>
> - Hardware not powerfull enough (we started with t1.micro from amazon then
> small, medium, and now we use m1.large)
> - Wrong configuration of cassandra/phpCassa (overall cassandra...) we
> learnt a lot before getting a stable cluster, finally.
>
> Get JNA installed, enough heap memory, increase timeouts in your cassandra
> client (overcount is often due to timeouts, themselves often produced by
> cpu highly loaded), get your cpu load down (getting more memory or
> configuring it well, eventually tuning compaction_throughput_mb_per_sec and
> disabling multithreaded_compaction)... I can't tell you much more about
> config because there is a lot of different things you can do to improve
> Cassandra performances and I'm not an expert :/.
>
> Hope it may help somehow and you'll find out what's wrong.
>
> Alain
>
> 2012/9/18 rohit bhatia <rohit2412@gmail.com>
>
>> @Robin
>> I'm pretty sure the GC issue is due to counters only. Since we have
>> only write-heavy counter incrementing traffic.
>> GC Frequency also increases linearly with write load.
>>
>> @Bartlomiej
>> On Stress Testing, we see GC frequency and consequently write latency
>> increase to several milliseconds.
>> At 50k qps we had GC running every 1-2 second. And since each Parnew
>> takes around 100ms, we were spending 10% of each server's time GCing.
>>
>> Also, we don't have persistent connections, but testing with
>> persistent connections give roughly the same result.
>>
>> At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young
>> Gen GC running on each node every 4 seconds (approximately).
>> We have a young gen heap size of 3200M which is already too big by any
>> standards.
>>
>> Also decreasing Replication factor from 2 to 1 reduced the GC
>> frequency 5-6 times.
>>
>> Any Advice?
>>
>> Also, our traffic is evenly
>> On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen <robin@us2.nl> wrote:
>> > We've not been trying to create inconsistencies as you describe above.
>> But
>> > it seems legit that those situations cause problems.
>> >
>> > Sometimes you can see log messages that indicate that counters are out
>> of
>> > sync in the cluster and they get "repaired". My guess would be that the
>> > repairs actually destroys it, however I have no knowledge of the
>> underlying
>> > techniques. I think this because of the fact that those read repairs
>> happen
>> > a lot (as you mention: lots of reads) and might get over-repaired or
>> > something? However, this is all just a guess. I hope someone with a lot
>> > knowledge about Cassandra internals can shed some light on this.
>> >
>> > Best regards,
>> >
>> > Robin Verlangen
>> > Software engineer
>> >
>> > W http://www.robinverlangen.nl
>> > E robin@us2.nl
>> >
>> > Disclaimer: The information contained in this message and attachments is
>> > intended solely for the attention and use of the named addressee and
>> may be
>> > confidential. If you are not the intended recipient, you are reminded
>> that
>> > the information remains the property of the sender. You must not use,
>> > disclose, distribute, copy, print or rely on this e-mail. If you have
>> > received this message in error, please contact the sender immediately
>> and
>> > irrevocably delete this message and any copies.
>> >
>> >
>> >
>> > 2012/9/18 Bartłomiej Romański <br@sentia.pl>
>> >>
>> >> Garbage is one more issue we are having with counters. We are
>> >> operating under very heavy load. Counters are spread over 7 nodes with
>> >> SSD drives and we often seeing CPU usage between 90-100%. We are doing
>> >> mostly reads. Latency is very important for us so GC pauses taking
>> >> longer than 10ms (often around 50-100ms) are very annoying.
>> >>
>> >> I don't have actual numbers right now, but we've also got the
>> >> impressions that cassandra generates "too much" garbage. Is there a
>> >> possible that counters are somehow guilty?
>> >>
>> >> @Rohit: Did you tried something more stressful? Like sending more
>> >> traffic to a node that it can actually handle, turning nodes up and
>> >> down, changing the topology (moving/adding nodes)? I believe our
>> >> problems comes from very high load and some operations like this
>> >> (adding new nodes, replacing dead ones etc...). I was expecting that
>> >> cassandra will fail some request, loose consistency temporarily or
>> >> something like that in such cases, but generation highly incorrect
>> >> values was very disappointing.
>> >>
>> >> Thanks,
>> >> Bartek
>> >>
>> >>
>> >> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <robin@us2.nl> wrote:
>> >> > @Rohit: We also use counters quite a lot (lets say 2000 increments
/
>> >> > sec),
>> >> > but don't see the 50-100KB of garbage per increment. Are you sure
>> that
>> >> > memory is coming from your counters?
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Robin Verlangen
>> >> > Software engineer
>> >> >
>> >> > W http://www.robinverlangen.nl
>> >> > E robin@us2.nl
>> >> >
>> >> > Disclaimer: The information contained in this message and
>> attachments is
>> >> > intended solely for the attention and use of the named addressee and
>> may
>> >> > be
>> >> > confidential. If you are not the intended recipient, you are reminded
>> >> > that
>> >> > the information remains the property of the sender. You must not use,
>> >> > disclose, distribute, copy, print or rely on this e-mail. If you have
>> >> > received this message in error, please contact the sender immediately
>> >> > and
>> >> > irrevocably delete this message and any copies.
>> >> >
>> >> >
>> >> >
>> >> > 2012/9/18 rohit bhatia <rohit2412@gmail.com>
>> >> >>
>> >> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
>> >> >> We use phpcassa and execute cql queries through thrift to work
with
>> >> >> composite types.
>> >> >>
>> >> >> We do not have any problem of overcounts as we tally with RDBMS
>> daily.
>> >> >>
>> >> >> It works fine but we are having some GC pressure for young
>> generation.
>> >> >> Per my calculation around 50-100 KB of garbage is generated every
>> >> >> counter increment.
>> >> >> Is this memory usage expected of counters?
>> >> >>
>> >> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br@sentia.pl>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > Does anyone have any experience with using Cassandra counters
in
>> >> >> > production?
>> >> >> >
>> >> >> > We rely heavily on them and recently we've got a few very
serious
>> >> >> > problems. Our counters values suddenly became a few times
higher
>> than
>> >> >> > expected. From the business point of view this is a disaster
:/
>> Also
>> >> >> > there a few open major bugs related to them. Some of them
for
>> quite
>> >> >> > long (months).
>> >> >> >
>> >> >> > We are seriously considering going back to other solutions
(e.g.
>> SQL
>> >> >> > databases). We simply cannot afford incorrect counter values.
We
>> can
>> >> >> > tolerate loosing a few increments from time to time, but we
cannot
>> >> >> > tolerate having counters suddenly 3 times higher or lower
than the
>> >> >> > expected values.
>> >> >> >
>> >> >> > What is the current status of counters? Should I consider
them a
>> >> >> > production-ready feature and we just have some bad luck? Or
>> should I
>> >> >> > rather consider them as a experimental-feature and look for
some
>> >> >> > other
>> >> >> > solutions?
>> >> >> >
>> >> >> > Do you have any experiences with them? Any comments would
be very
>> >> >> > helpful for us!
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Bartek
>> >> >
>> >> >
>> >
>> >
>>
>
>

Mime
View raw message