cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Chan (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5483) Repair tracing
Date Wed, 11 Jun 2014 15:46:02 GMT


Ben Chan commented on CASSANDRA-5483:

I'm sorry I can't think this through very deeply at the moment, so please allow a little slack
if I say something incorrect. I'm writing this during a small window of time (where I can't
do anything else) in a crisis I'm dealing with.

bq. Why is command part of events instead of sessions?

It was a somewhat arbitrary design decision. I can move it over to sessions. The only real
reason was historical (see the Mar 3 comment); it was a proof-of-concept that never got followed
up upon until just now.

bq. Also: should use an enum internally. Logging as string representation is fine.

Just to be clear, you mean Java code should work with an enum, and the actual cql table column
is fine as a string?

The code actually does use an enum (of sorts; not an Enum proper), the traceType. The traceType
is passed to Tracing#getCommandName to look up the String for command name.

bq. It makes people grumpy when we break JMX signatures. Can we add a new overload instead,
preserving the old? This should cut down on some of the code churn in StorageService as well.

I will admit that I didn't really consider the entire ecosystem of tools that use JMX to control
Cassandra (though note that I did mention the JMX api change in a Mar 8 comment "... the server
isn't going to work with old versions of nodetool. And a newer nodetool still won't work with
older server versions.").

bq. It's a minor thing to get hung up on, but I'm not wild about all the work needed to propagate
TTLs around. Is it really super important to persist repair traces much longer than queries?
If so, what if we used a separate table and just allowed users to modify the default ttl?
(The original trace code predates default TTLs, I think, or we would have made use of it there.)

I guess the question is how many different types of things (bootstrap, repair, query, decommission,
... anything else?) might eventually end up being traced. If n is small, then having n tables
may be fine.

The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24 hours may
be too short of a ttl.

I recall reading about problematic repairs taking several days, which wouldn't mix well with
a 24 hour ttl.

bq. Also have a nagging feeling that the notify/wait logic in StorageService + TraceState
is more complex than necessary.

If there is guaranteed to only be one consumer of notifications at a time, then the updated
v15 logic seems fine. But if there are ever two traces going on (either of different or the
same type; are you allowed to have two simultaneous repairs of different keyspaces?) which
require update notifications, then there could be dropped notifications. The problem (I believe)
is that all consumers may not be in a wait state at the moment when notifyAll is signalled.
This means a notification could be missed, right? I'm not experienced in Java concurrency,
and this isn't the best time for me to slowly think things through, so it's quite possible
I'm wrong here.

However, if you can be completely sure there will never be concurrent repair traces happening
on the same node, or any other trace types (whatever types are added in the future) that require
update notifications in order to implement on-the-fly reporting, then that issue is moot,
and v15 should work fine, as far as my cursory inspection goes.

bq. I should note I've made no attempt to corroborate this behaviour is sensible; I've only
simplified it.

Any feedback would be welcome. As I've said before, heuristics are messy. I talked about the
reasoning behind my design decisions, and a possibility for an alternate implementation (with
attendant tradeoffs) in a Mar 17 comment. I honestly thought I'd get more comments on it at
the time, but it's possible the patch had already gotten into TL; DR territory even then.


Okay my short break from reality is over. Time to panic.

> Repair tracing
> --------------
>                 Key: CASSANDRA-5483
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-full-trunk.txt, 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch,, 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch,
5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch, 5483-v07-08-Fix-brace-style.patch,
5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch, 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch,
5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch,
5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch,
5483-v10-17-minor-bugfixes-and-changes.patch, 5483-v10-rebased-and-squashed-471f5cc.patch,
5483-v11-01-squashed.patch, 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch,
5483-v13-608fb03-May-14-trace-formatting-changes.patch, 5483-v14-01-squashed.patch, 5483-v15.patch,
ccm-repair-test, cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff,
test-5483-system_traces-events.txt, trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch,
trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch, trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt,
trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt, v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch,
v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch, v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
> I think it would be nice to log repair stats and results like query tracing stores traces
to system keyspace. With it, you don't have to lookup each log file to see what was the status
and how it performed the repair you invoked. Instead, you can query the repair log with session
ID to see the state and stats of all nodes involved in that repair session.

This message was sent by Atlassian JIRA

View raw message