cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: OOM at Bootstrap Time
Date Wed, 29 Oct 2014 19:05:43 GMT
Some ideas:

1) Put on DEBUG log on the joining node to see what is going on in details
with the stream with 1500 files

2) Check the stream ID to see whether it's a new stream or an old one
pending



On Wed, Oct 29, 2014 at 2:21 AM, Maxime <maximelb@gmail.com> wrote:

> Doan, thanks for the tip, I just read about it this morning, just waiting
> for the new version to pop up on the debian datastax repo.
>
> Michael, I do believe you are correct in the general running of the
> cluster and I've reset everything.
>
> So it took me a while to reply, I finally got the SSTables down, as seen
> in the OpsCenter graphs. I'm stumped however because when I bootstrap the
> new node, I still see very large number of files being streamed (~1500 for
> some nodes) and the bootstrap process is failing exactly as it did before,
> in a flury of "Enqueuing flush of ..."
>
> Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
> around 32 SStables per CF, but still streaming tons of "files". :-/
>
>
> On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:
>
>> "Tombstones will be a very important issue for me since the dataset is
>> very much a rolling dataset using TTLs heavily."
>>
>> --> You can try the new DateTiered compaction strategy (
>> https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1
>> if you have a time series data model to eliminate tombstones
>>
>> On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> Again, from our experience w 2.0.x:
>>>
>>> Revert to the defaults - you are manually setting heap way too high IMHO.
>>>
>>> On our small nodes we tried LCS - way too much compaction - switch all
>>> CFs to STCS.
>>>
>>> We do a major rolling compaction on our small nodes weekly during less
>>> busy hours - works great. Be sure you have enough disk.
>>>
>>> We never explicitly delete and only use ttls or truncation. You can set
>>> GC to 0 in that case, so tombstones are more readily expunged. There are a
>>> couple threads in the list that discuss this... also normal rolling repair
>>> becomes optional, reducing load (still repair if something unusual happens
>>> tho...).
>>>
>>> In your current situation, you need to kickstart compaction - are there
>>> any CFs you can truncate at least temporarily? Then try compacting a small
>>> CF, then another, etc.
>>>
>>> Hopefully you can get enough headroom to add a node.
>>>
>>> ml
>>>
>>>
>>>
>>>
>>> On Sun, Oct 26, 2014 at 6:24 PM, Maxime <maximelb@gmail.com> wrote:
>>>
>>>> Hmm, thanks for the reading.
>>>>
>>>> I initially followed some (perhaps too old) maintenance scripts, which
>>>> included weekly 'nodetool compact'. Is there a way for me to undo the
>>>> damage? Tombstones will be a very important issue for me since the dataset
>>>> is very much a rolling dataset using TTLs heavily.
>>>>
>>>> On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan <doanduyhai@gmail.com>
>>>> wrote:
>>>>
>>>>> "Should doing a major compaction on those nodes lead to a restructuration
>>>>> of the SSTables?" --> Beware of the major compaction on SizeTiered,
it will
>>>>> create 2 giant SSTables and the expired/outdated/tombstone columns in
this
>>>>> big file will be never cleaned since the SSTable will never get a chance
to
>>>>> be compacted again
>>>>>
>>>>> Essentially to reduce the fragmentation of small SSTables you can stay
>>>>> with SizeTiered compaction and play around with compaction properties
(the
>>>>> thresholds) to make C* group a bunch of files each time it compacts so
that
>>>>> the file number shrinks to a reasonable count
>>>>>
>>>>> Since you're using C* 2.1 and anti-compaction has been introduced, I
>>>>> hesitate advising you to use Leveled compaction as a work-around to reduce
>>>>> SSTable count.
>>>>>
>>>>>  Things are a little bit more complicated because of the incremental
>>>>> repair process (I don't know whether you're using incremental repair
or not
>>>>> in production). The Dev blog says that Leveled compaction is performed
only
>>>>> on repaired SSTables, the un-repaired ones still use SizeTiered, more
>>>>> details here:
>>>>> http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad <jon@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> If the issue is related to I/O, you're going to want to determine
if
>>>>>> you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
>>>>>> (queue size) and svctm, (service time).    The higher those numbers
>>>>>> are, the most overwhelmed your disk is.
>>>>>>
>>>>>> On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduyhai@gmail.com>
>>>>>> wrote:
>>>>>> > Hello Maxime
>>>>>> >
>>>>>> > Increasing the flush writers won't help if your disk I/O is
not
>>>>>> keeping up.
>>>>>> >
>>>>>> > I've had a look into the log file, below are some remarks:
>>>>>> >
>>>>>> > 1) There are a lot of SSTables on disk for some tables (events
for
>>>>>> example,
>>>>>> > but not only). I've seen that some compactions are taking up
to 32
>>>>>> SSTables
>>>>>> > (which corresponds to the default max value for SizeTiered
>>>>>> compaction).
>>>>>> >
>>>>>> > 2) There is a secondary index that I found suspicious :
>>>>>> loc.loc_id_idx. As
>>>>>> > its name implies I have the impression that it's an index on
the id
>>>>>> of the
>>>>>> > loc which would lead to almost an 1-1 relationship between the
>>>>>> indexed value
>>>>>> > and the original loc. Such index should be avoided because they
do
>>>>>> not
>>>>>> > perform well. If it's not an index on the loc_id, please disregard
>>>>>> my remark
>>>>>> >
>>>>>> > 3) There is a clear imbalance of SSTable count on some nodes.
In
>>>>>> the log, I
>>>>>> > saw:
>>>>>> >
>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360
>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>> > ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes),
>>>>>> sending 0
>>>>>> > files(0 bytes)
>>>>>> >
>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121
>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>> > ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes),
>>>>>> sending 0
>>>>>> > files(0 bytes)
>>>>>> >
>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494
>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>> > ID#0] Prepare completed. Receiving 1315 files(4 606 316 933
bytes),
>>>>>> sending
>>>>>> > 0 files(0 bytes)
>>>>>> >
>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036
>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>> > ID#0] Prepare completed. Receiving 1640 files(3 208 023 573
bytes),
>>>>>> sending
>>>>>> > 0 files(0 bytes)
>>>>>> >
>>>>>> >  As you can see, the existing 4 nodes are streaming data to
the new
>>>>>> node and
>>>>>> > on average the data set size is about 3.3 - 4.5 Gb. However
the
>>>>>> number of
>>>>>> > SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and
>>>>>> > xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files
for
>>>>>> > xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217
>>>>>> >
>>>>>> >  The total data set size is roughly the same but the file number
is
>>>>>> x10,
>>>>>> > which mean that you'll have a bunch of tiny files.
>>>>>> >
>>>>>> >  I guess that upon reception of those files, there will be a
>>>>>> massive flush
>>>>>> > to disk, explaining the behaviour you're facing (flush storm)
>>>>>> >
>>>>>> > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and
>>>>>> xxxx.xxxx.xxxx.217 to
>>>>>> > check for the total SSTable count for each table to confirm
this
>>>>>> intuition
>>>>>> >
>>>>>> > Regards
>>>>>> >
>>>>>> >
>>>>>> > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maximelb@gmail.com>
wrote:
>>>>>> >>
>>>>>> >> I've emailed you a raw log file of an instance of this happening.
>>>>>> >>
>>>>>> >> I've been monitoring more closely the timing of events in
tpstats
>>>>>> and the
>>>>>> >> logs and I believe this is what is happening:
>>>>>> >>
>>>>>> >> - For some reason, C* decides to provoke a flush storm (I
say some
>>>>>> reason,
>>>>>> >> I'm sure there is one but I have had difficulty determining
the
>>>>>> behaviour
>>>>>> >> changes between 1.* and more recent releases).
>>>>>> >> - So we see ~ 3000 flush being enqueued.
>>>>>> >> - This happens so suddenly that even boosting the number
of flush
>>>>>> writers
>>>>>> >> to 20 does not suffice. I don't even see "all time blocked"
>>>>>> numbers for it
>>>>>> >> before C* stops responding. I suspect this is due to the
sudden
>>>>>> OOM and GC
>>>>>> >> occurring.
>>>>>> >> - The last tpstat that comes back before the node goes down
>>>>>> indicates 20
>>>>>> >> active and 3000 pending and the rest 0. It's by far the
anomalous
>>>>>> activity.
>>>>>> >>
>>>>>> >> Is there a way to throttle down this generation of Flush?
C*
>>>>>> complains if
>>>>>> >> I set the queue_size to any value (deprecated now?) and
boosting
>>>>>> the threads
>>>>>> >> does not seem to help since even at 20 we're an order of
magnitude
>>>>>> off.
>>>>>> >>
>>>>>> >> Suggestions? Comments?
>>>>>> >>
>>>>>> >>
>>>>>> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduyhai@gmail.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hello Maxime
>>>>>> >>>
>>>>>> >>>  Can you put the complete logs and config somewhere
? It would be
>>>>>> >>> interesting to know what is the cause of the OOM.
>>>>>> >>>
>>>>>> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maximelb@gmail.com>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Thanks a lot that is comforting. We are also small
at the moment
>>>>>> so I
>>>>>> >>>> definitely can relate with the idea of keeping small
and simple
>>>>>> at a level
>>>>>> >>>> where it just works.
>>>>>> >>>>
>>>>>> >>>> I see the new Apache version has a lot of fixes
so I will try to
>>>>>> upgrade
>>>>>> >>>> before I look into downgrading.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On Saturday, October 25, 2014, Laing, Michael
>>>>>> >>>> <michael.laing@nytimes.com> wrote:
>>>>>> >>>>>
>>>>>> >>>>> Since no one else has stepped in...
>>>>>> >>>>>
>>>>>> >>>>> We have run clusters with ridiculously small
nodes - I have a
>>>>>> >>>>> production cluster in AWS with 4GB nodes each
with 1 CPU and
>>>>>> disk-based
>>>>>> >>>>> instance storage. It works fine but you can
see those little
>>>>>> puppies
>>>>>> >>>>> struggle...
>>>>>> >>>>>
>>>>>> >>>>> And I ran into problems such as you observe...
>>>>>> >>>>>
>>>>>> >>>>> Upgrading Java to the latest 1.7 and - most
importantly -
>>>>>> reverting to
>>>>>> >>>>> the default configuration, esp. for heap, seemed
to settle
>>>>>> things down
>>>>>> >>>>> completely. Also make sure that you are using
the 'recommended
>>>>>> production
>>>>>> >>>>> settings' from the docs on your boxen.
>>>>>> >>>>>
>>>>>> >>>>> However we are running 2.0.x not 2.1.0 so YMMV.
>>>>>> >>>>>
>>>>>> >>>>> And we are switching to 15GB nodes w 2 heftier
CPUs each and SSD
>>>>>> >>>>> storage - still a 'small' machine, but much
more reasonable for
>>>>>> C*.
>>>>>> >>>>>
>>>>>> >>>>> However I can't say I am an expert, since I
deliberately keep
>>>>>> things so
>>>>>> >>>>> simple that we do not encounter problems - it
just works so I
>>>>>> dig into other
>>>>>> >>>>> stuff.
>>>>>> >>>>>
>>>>>> >>>>> ml
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maximelb@gmail.com>
>>>>>> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Hello, I've been trying to add a new node
to my cluster ( 4
>>>>>> nodes )
>>>>>> >>>>>> for a few days now.
>>>>>> >>>>>>
>>>>>> >>>>>> I started by adding a node similar to my
current
>>>>>> configuration, 4 GB
>>>>>> >>>>>> or RAM + 2 Cores on DigitalOcean. However
every time, I would
>>>>>> end up getting
>>>>>> >>>>>> OOM errors after many log entries of the
type:
>>>>>> >>>>>>
>>>>>> >>>>>> INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
>>>>>> >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush
of mycf: 5383
>>>>>> (0%) on-heap, 0
>>>>>> >>>>>> (0%) off-heap
>>>>>> >>>>>>
>>>>>> >>>>>> leading to:
>>>>>> >>>>>>
>>>>>> >>>>>> ka-120-Data.db (39291 bytes) for commitlog
position
>>>>>> >>>>>> ReplayPosition(segmentId=1414243978538,
position=23699418)
>>>>>> >>>>>> WARN  [SharedPool-Worker-13] 2014-10-25
13:48:18,032
>>>>>> >>>>>> AbstractTracingAwareExecutorService.java:167
- Uncaught
>>>>>> exception on thread
>>>>>> >>>>>> Thread[SharedPool-Worker-13,5,main]: {}
>>>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>> >>>>>>
>>>>>> >>>>>> Thinking it had to do with either compaction
somehow or
>>>>>> streaming, 2
>>>>>> >>>>>> activities I've had tremendous issues with
in the past; I
>>>>>> tried to slow down
>>>>>> >>>>>> the setstreamthroughput to extremely low
values all the way to
>>>>>> 5. I also
>>>>>> >>>>>> tried setting setcompactionthoughput to
0, and then reading
>>>>>> that in some
>>>>>> >>>>>> cases it might be too fast, down to 8. Nothing
worked, it
>>>>>> merely vaguely
>>>>>> >>>>>> changed the mean time to OOM but not in
a way indicating
>>>>>> either was anywhere
>>>>>> >>>>>> a solution.
>>>>>> >>>>>>
>>>>>> >>>>>> The nodes were configured with 2 GB of Heap
initially, I tried
>>>>>> to
>>>>>> >>>>>> crank it up to 3 GB, stressing the host
memory to its limit.
>>>>>> >>>>>>
>>>>>> >>>>>> After doing some exploration (I am considering
writing a
>>>>>> Cassandra Ops
>>>>>> >>>>>> documentation with lessons learned since
there seems to be
>>>>>> little of it in
>>>>>> >>>>>> organized fashions), I read that some people
had strange
>>>>>> issues on lower-end
>>>>>> >>>>>> boxes like that, so I bit the bullet and
upgraded my new node
>>>>>> to a 8GB + 4
>>>>>> >>>>>> Core instance, which was anecdotally better.
>>>>>> >>>>>>
>>>>>> >>>>>> To my complete shock, exact same issues
are present, even
>>>>>> raising the
>>>>>> >>>>>> Heap memory to 6 GB. I figure it can't be
a "normal" situation
>>>>>> anymore, but
>>>>>> >>>>>> must be a bug somehow.
>>>>>> >>>>>>
>>>>>> >>>>>> My cluster is 4 nodes, RF of 2, about 160
GB of data across
>>>>>> all nodes.
>>>>>> >>>>>> About 10 CF of varying sizes. Runtime writes
are between 300
>>>>>> to 900 /
>>>>>> >>>>>> second. Cassandra 2.1.0, nothing too wild.
>>>>>> >>>>>>
>>>>>> >>>>>> Has anyone encountered these kinds of issues
before? I would
>>>>>> really
>>>>>> >>>>>> enjoy hearing about the experiences of people
trying to run
>>>>>> small-sized
>>>>>> >>>>>> clusters like mine. From everything I read,
Cassandra
>>>>>> operations go very
>>>>>> >>>>>> well on large (16 GB + 8 Cores) machines,
but I'm sad to
>>>>>> report I've had
>>>>>> >>>>>> nothing but trouble trying to run on smaller
machines, perhaps
>>>>>> I can learn
>>>>>> >>>>>> from other's experience?
>>>>>> >>>>>>
>>>>>> >>>>>> Full logs can be provided to anyone interested.
>>>>>> >>>>>>
>>>>>> >>>>>> Cheers
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jon Haddad
>>>>>> http://www.rustyrazorblade.com
>>>>>> twitter: rustyrazorblade
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message