cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Spindler <brian.spind...@gmail.com>
Subject Re: Nodes just dieing with OOM
Date Fri, 06 Oct 2017 17:02:45 GMT
Hi Alain, thanks for getting back to me.  I will read through those
articles.

The truncate did solve the problem.
I am using Cassandra 2.1.15
I'll look at cfstats in more detail, we've got some charting from JVM
metrics yeah.
We're migrating from i2.xl (32GB ram, Local SSD) to m4.xl (16gb, gp2) so we
have a mix there, Cassandra JVM set to 10GB

When I did a truncate, Cassandra did create a snapshot which I'm hoping to
copy over to a developer's machine and find the offending row(s).  If it is
just huge rows, that's probably more of an application leak.

Is 'Compacted partition maximum bytes:' from cfstats the right thing to
look at?

Thanks again,
-B

On Fri, Oct 6, 2017 at 10:40 AM Alain RODRIGUEZ <arodrime@gmail.com> wrote:

> Hello Brian.
>
> Sorry to hear, looks like a lot of troubles.
>
> I think we should review this column family design so it doesn't generate
>> so many tombstones?  Could that be the cause?
>
>
> It could be indeed, did truncating solved the issue?
>
> There so nicer approaches you can try to handle tombstones correctly
> depending on your use case. I wrote a post and presented a talk about this
> last year, I hope you'll find what you are looking for.
>
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
> https://www.youtube.com/watch?v=lReTEcnzl7Y
>
>  What else would you recommend?
>
>
> Well we don't have much information to guess. But I will try to give you
> relevant clues with what you gave us so far:
>
> that one column family had some large/corrupt data and causing OOM's
>>
>
> Are you using Cassandra 3.0.x (x < 14)? You might be facing a bug in
> Cassandra corrupting data after schema changes (
> https://issues.apache.org/jira/browse/CASSANDRA-13004).
>
> You can check large partition using 'nodetool cfstats' or using monitoring
> and corresponding metric (per table / columnfamily)
>
> Other than that what is the memory available, the heap size and GC type
> and options in use. Do you see some GC pauses in the logs or do you control
> this value through a chart using JVM metrics?
>
> C*heers,
>
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2017-10-06 14:48 GMT+01:00 Brian Spindler <brian.spindler@gmail.com>:
>
>> Sorry about that.  We eventually found that one column family had some
>> large/corrupt data and causing OOM's
>>
>> Luckily it was a pretty ephemeral data set and we were able to just
>> truncate it.  However, it was a guess based on some log messages about
>> reading a large number of tombstones on that column families.  I think we
>> should review this column family design so it doesn't generate so many
>> tombstones?  Could that be the cause?  What else would you recommend?
>>
>> Thank you in advance.
>>
>> On Fri, Oct 6, 2017 at 6:33 AM Brian Spindler <brian.spindler@gmail.com>
>> wrote:
>>
>>> Hi guys, our cluster - around 18 nodes - just starting having nodes die
>>> and when restarting them they are dying with OOM.  How can we handle this?
>>>  I've tried adding a couple extra gigs on these machines to help but it's
>>> not.
>>>
>>> Help!
>>> -B
>>>
>>>
>

Mime
View raw message