cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Elias Del Valle <marc...@s1mbi0se.com.br>
Subject Re: Batch of prepared statements exceeding specified threshold
Date Mon, 30 Jun 2014 18:52:06 GMT
Hi,

I think it's a bit late for this reply, but anyway...
We hired support from http://thelastpickle.com/ to solve our problem and
thanks to them we were able to solve our issue as well.
What was causing this behavior was a large query being executed by mistake
in our code.
It was needed to open the java heap with Cassandra source and find the
query taking the large amount of memory.
Lesson learned to me: Cassandra support is something really needed and
useful, it's not like Oracle's, IBM, etc...

Best regards,
Marcelo Valle.



2014-06-20 18:15 GMT-03:00 Pavel Kogan <pavel.kogan@cortica.com>:

> Ok, in my case it was straightforward. It is just warning, which however
> says that batches with large data size (above 5Kb) can sometimes lead to
> node instability (why?). This limit seems to be hard-coded, I didn't find
> anyway to configure it externally. Anyway, removing batch and giving up
> atomicity, resolved the issue for me.
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201404.mbox/%3Ceee5dd5bc4794ef0b5c5153fdb583ac0@git.apache.org%3E
>
>
> On Fri, Jun 20, 2014 at 3:55 PM, Pavel Kogan <pavel.kogan@cortica.com>
> wrote:
>
>> Logged batch.
>>
>>
>> On Fri, Jun 20, 2014 at 2:13 PM, DuyHai Doan <doanduyhai@gmail.com>
>> wrote:
>>
>>> I think some figures from "nodetool tpstats" and "nodetool
>>> compactionstats" may help seeing clearer
>>>
>>> And Pavel, when you said batch, did you mean LOGGED batch or UNLOGGED
>>> batch ?
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 20, 2014 at 8:02 PM, Marcelo Elias Del Valle <
>>> marcelo@s1mbi0se.com.br> wrote:
>>>
>>>> If you have 32 Gb RAM, the heap is probably 8Gb.
>>>> 200 writes of 100 kb / s would be 20MB / s in the worst case, supposing
>>>> all writes of a replica goes to a single node.
>>>> I really don't see any reason why it should be filling up the heap.
>>>> Anyone else?
>>>>
>>>> But did you check the logs for the GCInspector?
>>>> In my case, nodes are falling because of the heap, in your case, maybe
>>>> it's something else.
>>>> Do you see increased times when looking for GCInspector in the logs?
>>>>
>>>> []s
>>>>
>>>>
>>>>
>>>> 2014-06-20 14:51 GMT-03:00 Pavel Kogan <pavel.kogan@cortica.com>:
>>>>
>>>> Hi Marcelo,
>>>>>
>>>>> No pending write tasks, I am writing a lot, about 100-200 writes each
>>>>> up to 100Kb every 15[s].
>>>>> It is running on decent cluster of 5 identical nodes, quad cores i7
>>>>> with 32Gb RAM and 480Gb SSD.
>>>>>
>>>>> Regards,
>>>>>   Pavel
>>>>>
>>>>>
>>>>> On Fri, Jun 20, 2014 at 12:31 PM, Marcelo Elias Del Valle <
>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>
>>>>>> Pavel,
>>>>>>
>>>>>> In my case, the heap was filling up faster than it was draining.
I am
>>>>>> still looking for the cause of it, as I could drain really fast with
SSD.
>>>>>>
>>>>>> However, in your case you could check (AFAIK) nodetool tpstats and
>>>>>> see if there are too many pending write tasks, for instance. Maybe
you
>>>>>> really are writting more than the nodes are able to flush to disk.
>>>>>>
>>>>>> How many writes per second are you achieving?
>>>>>>
>>>>>> Also, I would look for GCInspector in the log:
>>>>>>
>>>>>> cat system.log* | grep GCInspector | wc -l
>>>>>> tail -1000 system.log | grep GCInspector
>>>>>>
>>>>>> Do you see it running a lot? Is it taking much more time to run each
>>>>>> time it runs?
>>>>>>
>>>>>> I am no Cassandra expert, but I would try these things first and
post
>>>>>> the results here. Maybe other people in the list have more ideas.
>>>>>>
>>>>>> Best regards,
>>>>>> Marcelo.
>>>>>>
>>>>>>
>>>>>> 2014-06-20 8:50 GMT-03:00 Pavel Kogan <pavel.kogan@cortica.com>:
>>>>>>
>>>>>> The cluster is new, so no updates were done. Version 2.0.8.
>>>>>>> It happened when I did many writes (no reads). Writes are done
in
>>>>>>> small batches of 2 inserts (writing to 2 column families). The
values are
>>>>>>> big blobs (up to 100Kb).
>>>>>>>
>>>>>>> Any clues?
>>>>>>>
>>>>>>> Pavel
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 19, 2014 at 8:07 PM, Marcelo Elias Del Valle <
>>>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>>>
>>>>>>>> Pavel,
>>>>>>>>
>>>>>>>> Out of curiosity, did it start to happen before some update?
Which
>>>>>>>> version of Cassandra are you using?
>>>>>>>>
>>>>>>>> []s
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-06-19 16:10 GMT-03:00 Pavel Kogan <pavel.kogan@cortica.com>:
>>>>>>>>
>>>>>>>>> What a coincidence! Today happened in my cluster of 7
nodes as
>>>>>>>>> well.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>   Pavel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 18, 2014 at 11:13 AM, Marcelo Elias Del Valle
<
>>>>>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>>>>>
>>>>>>>>>> I have a 10 node cluster with cassandra 2.0.8.
>>>>>>>>>>
>>>>>>>>>> I am taking this exceptions in the log when I run
my code. What
>>>>>>>>>> my code does is just reading data from a CF and in
some cases it writes new
>>>>>>>>>> data.
>>>>>>>>>>
>>>>>>>>>>  WARN [Native-Transport-Requests:553] 2014-06-18
11:04:51,391
>>>>>>>>>> BatchStatement.java (line 228) Batch of prepared
statements for
>>>>>>>>>> [identification1.entity, identification1.entity_lookup]
is of size 6165,
>>>>>>>>>> exceeding specified threshold of 5120 by 1045.
>>>>>>>>>>  WARN [Native-Transport-Requests:583] 2014-06-18
11:05:01,152
>>>>>>>>>> BatchStatement.java (line 228) Batch of prepared
statements for
>>>>>>>>>> [identification1.entity, identification1.entity_lookup]
is of size 21266,
>>>>>>>>>> exceeding specified threshold of 5120 by 16146.
>>>>>>>>>>  WARN [Native-Transport-Requests:581] 2014-06-18
11:05:20,229
>>>>>>>>>> BatchStatement.java (line 228) Batch of prepared
statements for
>>>>>>>>>> [identification1.entity, identification1.entity_lookup]
is of size 22978,
>>>>>>>>>> exceeding specified threshold of 5120 by 17858.
>>>>>>>>>>  INFO [MemoryMeter:1] 2014-06-18 11:05:32,682 Memtable.java
(line
>>>>>>>>>> 481) CFS(Keyspace='OpsCenter', ColumnFamily='rollups300')
liveRatio is
>>>>>>>>>> 14.249755859375 (just-counted was 9.85302734375).
 calculation took 3ms for
>>>>>>>>>> 1024 cells
>>>>>>>>>>
>>>>>>>>>> After some time, one node of the cluster goes down.
Then it goes
>>>>>>>>>> back after some seconds and another node goes down.
It keeps happening and
>>>>>>>>>> there is always a node down in the cluster, when
it goes back another one
>>>>>>>>>> falls.
>>>>>>>>>>
>>>>>>>>>> The only exceptions I see in the log is "connected
reset by the
>>>>>>>>>> peer", which seems to be relative to gossip protocol,
when a node goes down.
>>>>>>>>>>
>>>>>>>>>> Any hint of what could I do to investigate this problem
further?
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Marcelo Valle.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message