aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: aurora replica log snapshot interval
Date Tue, 02 Jun 2015 17:25:25 GMT
Hi Bhuvan,

We have never had to change the native_log_write timeout from its
default value but we have definitely seen problems with scheduler
failovers related to snapshotting. It is indeed an IO intensive
operation that may and will block all other activities especially when
overlapped with a backup creation. During the snapshot creation an
exclusive write lock is held making all other mutation operations
impossible. Reads may still be served though.

I would suggest a more thorough investigation to make sure it was
truly a native_log_write timeout that caused your failover.
Identifying the root cause is crucial here as we have seen two major
causes for failovers: excessive GC activity leading to ZK timeouts and
slow disk IO blocking writes in underlying native log storage. Below
are a few leads:

Excessive GC:
- consider using snapshot de-duplication [1] if you are not already
using it. This has helped us significantly reduce GC activity and
stored snapshot size.
- consider finely tuning your GC perf. It's not an easy task but there
are plenty of online resources to help (e.g. [2]).

Excessive IO:
- consider changing your underlying system IO scheduler. By just
switching from cfq to deadline we have virtually eliminated our
failovers due to excessive IO. See AURORA-1211 for details.

Thanks,
Maxim

[1] - https://github.com/apache/aurora/blob/master/docs/scheduler-storage.md
[2] - http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/

On Tue, Jun 2, 2015 at 9:33 AM, Bhuvan Arumugam <bhuvan@apache.org> wrote:
> Hello,
>
> In a 300 nodes cluster with 5 scheduler in the quorum, the replica log
> writes fail due to timeout (native_log_write_timeout: 3secs)
> especially when 50+ tasks are flapping. The next leader takes around
> 2mins+ to complete the log replay and become active. The service is
> inaccessible to users, as aurora isn't yet listening on the port.
> Users face 503 errors. Why? The snapshot wasn't taken during last few
> hours because the crash happen within configured snapshot interval
> (default: 1 hour).
>
> We bumped the log write timeout and in parallel investigating the
> reason for timeout, whether it's due to bad hardware, etc. In the
> meantime, we want to reduce service disruption to the users by
> bringing down the replay time. I like to know,
>
> a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins
> the right thing to do
> b) it snapshot event i/o intensive?
> c) it takes 0-6 seconds to snapshot 10k events, from last snapshot.
> does the scheduler block user requests when snapshot is in progress?
>
> Thank you,
> --
> Regards,
> Bhuvan Arumugam
> www.livecipher.com

Mime
View raw message