cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Chan (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5483) Repair tracing
Date Thu, 10 Jul 2014 15:01:20 GMT


Ben Chan commented on CASSANDRA-5483:

A few things to note about 5483-16:

Just to emphasize: these are corner-cases and deoptimizations that may not ever matter in
practice. Even if the decision is to not fix them, I wanted to note their existence.

If remote nodes have different config values for {{tracetype_repair_ttl}}, then remote traces
will end up with a different ttl. Having the trace for a given session partially disappear
may be surprising to the user.

The approach 5483-15 took was to use the ttl configured on the node being repaired.

I don't know whether this will ever arise in practice, but I thought I'd note: If server versions
are ever mixed, then an unknown trace type could be sent over the wire, causing an out-of-bounds
array access during deserialization.

The approach 5483-15 took was almost a no-op on the remote end (the tracing session was created,
but nothing was traced). This is still not quite right. Better would be to log an error and
not create the session at all.

It might still possibly make sense to create the session anyway in the case of A->B->C
tracestate propagation, but that might be stretching things.

Activity notification for every local repair trace starts a new exponential backoff polling
cycle more often than necessary.

Of course, I can't say whether this extra resource usage will actually matter in practice
(only practice can tell).


Okay, so what you meant was to change tracetype to be an actual Enum type (I can be a little
dense sometimes, and code leaves far less room for ambiguity).

One thing I was thinking forward to with bitmasks was flexible subsetting of traces and notifications.
Tracing may not ever need trace levels or flexible subsetting, but I thought I'd note the
idea for future reference.

// pseudocode; please excuse the bad naming sense.
REPAIR_0 = 0b001 << MIN_SHIFT;
REPAIR_1 = 0b010 << MIN_SHIFT;
REPAIR_2 = 0b100 << MIN_SHIFT;

// read "zero and up", etc
REPAIR_1U =            REPAIR_1 | REPAIR_2;

NOTIFY_0 = 0b001;
NOTIFY_1 = 0b010;
NOTIFY_2 = 0b100;
NOTIFY_0U = ...;
NOTIFY_1U = ...;

val = config.get_val();

trace(REPAIR_0, "only traced for REPAIR_0; superseded in the more verbose traces");
trace(REPAIR_0U, "always traced for repair");
trace(REPAIR_1U, "traced at REPAIR_1 and up");
trace(REPAIR_2 | val, "only at REPAIR_2, or during the tracetypes specified in val");
trace(REPAIR_1U & ~val, "only at REPAIR_1 and up, excluding the tracetypes specified in
trace(REPAIR_2 | REPAIR_0, "for some reason, we don't want to trace this at REPAIR_1");

// you could even have semi-orthogonal subsetting and levels
trace(REPAIR_0U | NOTIFY_1, "notify at NOTIFY_1 only");
trace(REPAIR_1U | NOTIFY_0U, "notify at all NOTIFY levels, but only when tracing REPAIR_1
and up");

awaitNotification(NOTIFY_1U); // notify me for NOTIFY_1 and up.

I believe log4j does something similar with what they call markers.

Of course, you can do this with Enum and EnumSet too (though more verbosely).


Aside from lyubent's nit (and possibly the issues in section 1), are there any other issues
that prevent this patch from being minimally useful?

Or even better, is there at least an alpha period where {{trunk}} is still allowed to make
breaking changes to the api, protocol, etc? It's hard to forsee everything.

> Repair tracing
> --------------
>                 Key: CASSANDRA-5483
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-full-trunk.txt, 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch,, 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,
5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch, 5483-v07-08-Fix-brace-style.patch,
5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch, 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch,
5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch,
5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch,
5483-v10-17-minor-bugfixes-and-changes.patch, 5483-v10-rebased-and-squashed-471f5cc.patch,
5483-v11-01-squashed.patch, 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch,
5483-v13-608fb03-May-14-trace-formatting-changes.patch, 5483-v14-01-squashed.patch, 5483-v15-02-Hook-up-exponential-backoff-functionality.patch,
5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch,
5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 5483-v15.patch, ccm-repair-test,
cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt,
trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt,
v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
> I think it would be nice to log repair stats and results like query tracing stores traces
to system keyspace. With it, you don't have to lookup each log file to see what was the status
and how it performed the repair you invoked. Instead, you can query the repair log with session
ID to see the state and stats of all nodes involved in that repair session.

This message was sent by Atlassian JIRA

View raw message