cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Chan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5483) Repair tracing
Date Thu, 06 Mar 2014 16:38:43 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922721#comment-13922721
] 

Ben Chan commented on CASSANDRA-5483:
-------------------------------------

It was more involved than I thought, partly because of heisenbugs and the trace state mysteriously
not propagating (see {{v06-05}}).

Note: changing JMX can cause mysterious errors if you don't {{ant clean && ant}}.
I ran into the same kinds of stack traces as you did. It's not consistent. Sometimes I can
make a JMX change and {{ant}} with no problem.

To make patches simpler, I'm posting full repro code. I also tried to simplify the naming.
Unfortunately, all the previous patches are in jumbled order due to a naming convention that
doesn't sort. Fortunately, JIRA seems to have an easter egg where you can choose the attachment
name by changing the url.

{noformat}
# Uncomment to exactly reproduce state.
#git checkout -b 5483-e30d6dc e30d6dc

# Download all needed patches with consistent names, apply patches, build.
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \
  $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch \
  $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \
  $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \
  $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \
  $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \
  $W/12633156/ccm-repair-test
do [ -e $(basename $url) ] || curl -sO $url; done &&
git apply 5483-v0[236]-*.patch &&
ant clean && ant

# put on a separate line because you should at least minimally inspect
# arbitrary code before running.
chmod +x ./ccm-repair-test && ./ccm-repair-test
{noformat}

{{ccm-repair-test}} has some options for convenience:
{noformat}
-k keep (don't delete) the created cluster after successful exit.
-r repair only
-R don't repair
-t do traced repair only
-T don't do traced repair (if neither, then do both traced and untraced repair)
{noformat}

The output of a test run:

{noformat}
Current cluster is now: test-5483-QiR
[2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for keyspace s1 (seq=true,
full=true)
[2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602]
finished
[2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808]
finished
[2014-03-06 10:46:17,465] Repair command #1 finished
[2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for keyspace system_traces
(seq=true, full=true)
[2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602]
finished
[2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808]
finished
[2014-03-06 10:46:18,816] Repair command #2 finished
0 rows exported in 0.015 seconds.
test-5483-QiR-system_traces-events.txt
ok
[2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for keyspace s1 (seq=true,
full=true)
[2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602]
finished
[2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808]
finished
[2014-03-06 10:46:25,415] Repair command #3 finished
[2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for keyspace system_traces
(seq=true, full=true)
[2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602]
finished
[2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808]
finished
[2014-03-06 10:46:27,120] Repair command #4 finished
48 rows exported in 0.104 seconds.
test-5483-QiR-system_traces-events-tr.txt
found source: 127.0.0.1
found thread: Thread-15
found thread: AntiEntropySessions:1
found thread: RepairJobTask:1
found source: 127.0.0.2
found thread: AntiEntropyStage:1
found source: 127.0.0.3
found thread: AntiEntropySessions:2
found thread: Thread-16
found thread: AntiEntropySessions:3
found thread: AntiEntropySessions:4
unique sources traced: 3
unique threads traced: 8
All thread categories accounted for
ok
{noformat}

---

Patch comments:

- {{v06-04}} I did something similar to {{v03-03}}, (almost) no refactoring. The implementation
is a little messy architecturally.
- {{v06-05}} This is the suggestion you had to add a "command" column. I don't know how to
make it the last column. At least on my box, it's column 5 of 7 despite me putting it last
in the cql. Note that {{ccm-repair-test}}'s checking code will break if the column order changes.
- {{v06-06}} You need to submit {{Runnable}} s, etc. using {{DebuggableThreadPoolExecutor}}
if you want them to inherit tracestate. Tracestate propagation is very easy to break under
concurrency, so this is probably the first thing to check if it ever happens again.


> Repair tracing
> --------------
>
>                 Key: CASSANDRA-5483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 5483-v06-05-Add-a-command-column-to-system_traces.events.patch,
5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, ccm-repair-test, test-5483-system_traces-events.txt,
trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
trunk@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, trunk@8ebeee1-5483-v01-002-simple-repair-tracing.txt,
v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing stores traces
to system keyspace. With it, you don't have to lookup each log file to see what was the status
and how it performed the repair you invoked. Instead, you can query the repair log with session
ID to see the state and stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message