Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EAE61C7BF for ; Tue, 23 Jul 2013 09:19:51 +0000 (UTC) Received: (qmail 95849 invoked by uid 500); 23 Jul 2013 09:19:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95621 invoked by uid 500); 23 Jul 2013 09:19:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95613 invoked by uid 99); 23 Jul 2013 09:19:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 09:19:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a50.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 09:19:44 +0000 Received: from homiemail-a50.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTP id 650F53E3B for ; Tue, 23 Jul 2013 02:21:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=TJ0PIpfcDbs76DtrunD988GXt0 g=; b=HV8LrGaKiD9Iq3/x3s/775NhBHSCUgX4pb7PSioI1CDo/EA8EMe0SyLOb6 utMhKLon7xmMfQmhULNtaglN+uHBjaLpV8NqkEHCMeMK3NanQoZPC/TE4bQXIPvu 6AIX4PoTqpdi7aWlzmggK95bQngZmrHEYFtzrmIf5lQ/Ff++o= Received: from [172.16.1.7] (unknown [203.86.207.101]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTPSA id AABB43E23 for ; Tue, 23 Jul 2013 02:21:25 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_BB572091-1574-4C98-A60E-6B40EF238851" Message-Id: <2AE0C378-25F3-43AF-9BD4-D004178D931A@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: CL1 and CLQ with 5 nodes cluster and 3 alives node Date: Tue, 23 Jul 2013 21:19:18 +1200 References: <1482135054.16794141374495311479.JavaMail.defaultUser@defaultHost> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_BB572091-1574-4C98-A60E-6B40EF238851 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 >> I really don't think I have more than 500 million rows ... any smart = way to >> count rows number inside the ks? use the output from nodetool cfstats, it has a row count and bloom = filter size for each CF.=20 You may also want to upgrade to 1.1 to get global cache management, that = can make things easier to manage.=20 Cheers ----------------- Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 23/07/2013, at 6:26 AM, Nate McCall wrote: > Do you have a copy of the specific stack trace? Given the version and > CL behavior, one thing you may be experiencing is: > https://issues.apache.org/jira/browse/CASSANDRA-4578 >=20 > On Mon, Jul 22, 2013 at 7:15 AM, cbertu81@libero.it = wrote: >> Hi Aaron, thanks for your help. >>=20 >>> If you have more than 500Million rows you may want to check the >> bloom_filter_fp_chance, the old default was 0.000744 and the new = (post 1.) >> number is > 0.01 for sized tiered. >>=20 >> I really don't think I have more than 500 million rows ... any smart = way to >> count rows number inside the ks? >>=20 >>>> Now a question -- why with 2 nodes offline all my application stop >> providing >>>> the service, even when a Consistency Level One read is invoked? >>=20 >>> What error did the client get and what client are you using ? >>> it also depends on if/how the node fails. The later versions try to = shut down >> when there is an OOM, not sure what 1.0 does. >>=20 >> The exception was a TTransportException -- I am using Pelops client. >>=20 >>> Is the node went into a zombie state the clients may have been = timing out. >> The should then move onto to another node. >>> If it had started shutting down the client should have gotten some = immediate >> errors. >>=20 >> It didn't shut down, it was more like in a zombie state, >> One more question: I'm experiencing some wrong counters (which are = very >> important in my platform since the are used to keep user-points and = generate >> the TopX users) --could it be related with this problem? The problem = is that in >> some users (not all) the counter column increased its value. >>=20 >> After such a crash in 1.0 is there any best-practice to follow? = (nodetool or >> something?) >>=20 >> Cheers, >> Carlo >>=20 >>>=20 >>> Cheers >>>=20 >>>=20 >>> ----------------- >>> Aaron Morton >>> Cassandra Consultant >>> New Zealand >>>=20 >>> @aaronmorton >>> http://www.thelastpickle.com >>>=20 >>> On 19/07/2013, at 5:02 PM, cbertu81@libero.it wrote: >>>=20 >>>> Hi all, >>>> I'm experiencing some problems after 3 years of cassandra in = production >> (from >>>> 0.6 to 1.0.6) -- for 2 times in 3 weeks 2 nodes crashed with = OutOfMemory >>>> Exception. >>>> In the log I can read the warn about the few heap available ... now = I'm >>>> increasing a little bit my RAM, my Java Heap (1/4 of the RAM) and = reducing >> the >>>> size of rows and memtables thresholds. Other tips? >>>>=20 >>>> Now a question -- why with 2 nodes offline all my application stop >> providing >>>> the service, even when a Consistency Level One read is invoked? >>>> I'd expected this behaviour: >>>>=20 >>>> CL1 operations keep working >>>> more than 80% of CLQ operations working (nodes offline where 2 and = 5 in a >>>> clockwise key distribution only writes to fifth node should impact = to node >> 2) >>>> most of all CLALL operations (that I don't use) failing >>>>=20 >>>> The situation instead was that I had ALL services stop responding = throwing >> a >>>> TTransportException ... >>>>=20 >>>> Thanks in advance >>>>=20 >>>> Carlo >>>=20 >>>=20 >>=20 >>=20 --Apple-Mail=_BB572091-1574-4C98-A60E-6B40EF238851 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1
I really don't = think I have more than 500 million rows ... any smart way to
count = rows number inside the ks?
use the output from = nodetool cfstats, it has a row count and bloom filter size for each = CF. 

You may also want to upgrade to 1.1 to get = global cache management, that can make things easier to = manage. 

Cheers

http://www.thelastpickle.com

On 23/07/2013, at 6:26 AM, Nate McCall <zznate.m@gmail.com> = wrote:

Do you have a copy of the specific stack trace? Given the = version and
CL behavior, one thing you may be experiencing is:
https://issu= es.apache.org/jira/browse/CASSANDRA-4578

On Mon, Jul 22, 2013 = at 7:15 AM, cbertu81@libero.it <cbertu81@libero.it> = wrote:
Hi Aaron, thanks for your = help.

If you have more than 500Million = rows you may want to check the
bloom_filter_fp_chance, = the old default was 0.000744 and the new (post 1.)
number is > = 0.01 for sized tiered.

I really don't think I have more than 500 = million rows ... any smart way to
count rows number inside the = ks?

Now a = question -- why with 2 nodes offline all my application = stop
providing
the service, even when a = Consistency Level One read is = invoked?

What = error did the client get and what client are you using ?
it also = depends on if/how the node fails. The later versions try to shut = down
when there is an OOM, not sure what 1.0 = does.

The exception was a TTransportException -- I am using = Pelops client.

Is the node went into a = zombie state the clients may have been timing out.
The = should then move onto to another node.
If = it had started shutting down the client should have gotten some = immediate
errors.

It didn't shut down, it was = more like in a zombie state,
One more question: I'm experiencing some = wrong counters (which are very
important in my platform since the are = used to keep user-points and generate
the TopX users) --could it be = related with this problem? The problem is that in
some users (not = all) the counter column increased its value.

After such a crash = in 1.0 is there any best-practice to follow? (nodetool = or
something?)

Cheers,
Carlo


Cheers


-----------------
Aaron = Morton
Cassandra Consultant
New = Zealand

@aaronmorton
http://www.thelastpickle.com

On = 19/07/2013, at 5:02 PM, cbertu81@libero.it wrote:

Hi all,
I'm experiencing some problems after 3 years of = cassandra in = production
(from
0.6 to 1.0.6) -- for 2 times in = 3 weeks 2 nodes crashed with OutOfMemory
Exception.
In the log I = can read the warn about the few heap available ... now I'm
increasing = a little bit my RAM, my Java Heap (1/4 of the RAM) and = reducing
the
size of rows and memtables = thresholds. Other tips?

Now a question -- why with 2 nodes = offline all my application = stop
providing
the service, even when a = Consistency Level One read is invoked?
I'd expected this = behaviour:

CL1 operations keep working
more than 80% of CLQ = operations working (nodes offline where 2 and 5 in a
clockwise key = distribution only writes to fifth node should impact to = node
2)
most of all CLALL operations = (that I don't use) failing

The situation instead was that I had = ALL services stop responding = throwing
a
TTransportException = ...

Thanks in = advance

Carlo





= --Apple-Mail=_BB572091-1574-4C98-A60E-6B40EF238851--