cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CASSANDRA-981) Dynamic endpoint snitch
Date Fri, 11 Jun 2010 16:54:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877855#action_12877855
] 

Jonathan Ellis commented on CASSANDRA-981:
------------------------------------------

(1)

combine

+        windows.putIfAbsent(host, new AdaptiveLatencyTracker(WINDOW_SIZE));
+        AdaptiveLatencyTracker tracker = windows.get(host);

to

+        AdaptiveLatencyTracker tracker = windows.putIfAbsent(host, new AdaptiveLatencyTracker(WINDOW_SIZE));

Even better: get, then putIfAbsent only if null, would avoid creating new ALT objects every
time

(2)

receiveTiming isn't threadsafe.  use AtomicInteger?

(3)

AdaptiveLatencyTracker doesn't look threadsafe either.  definitely LBD isn't.  I think just
using a threadsafe queue like CLQ would work?  (My fault for naming ASD a Deque, when it only
really needs a Queue)

(4) 

do we need unregister()?  if not, let's drop that api

(5)

sortByProximity needs to return zero if both scores are null.  even better, take a non-dynamic
snitch and use the static topology when there is no score info yet (this would save us from
sending data requests to another data center after every clear of the stats).  So, rather
than using DES directly in the config, maybe having a boolean for whether to wrap your regular
snitch, with the dynamic one, is the way to go.

what if reset cleared scores, instead of latencies?  this would result in a more gradual aging
out of both slow and fast latencies as new ones were pushed in, which would make it more tolerant
of brief hiccups where a mostly fast node had a couple slow responses.  Feels more how phi
was meant to work, to me.

(6)

+        if (address != FBUtilities.getLocalAddress()) // we only know about ourself
+            return addresses;

let's change this to an assert

(7)

deque.offer is more idiomatic than try/except in java

(8)

let's use a single timer for both update and reset

(9)

it's not completely clear to me that the phi code designed for telling "how long has been
too long to wait for updates that are supposed to arrive at the same interval," applies well
to latency information that arrives in bursts up to our max per interval.  can you add some
tests showing that it does the right thing, given several mixes of latencies?

> Dynamic endpoint snitch
> -----------------------
>
>                 Key: CASSANDRA-981
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-981
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>         Attachments: 981.txt
>
>
> An endpoint snitch that automatically and dynamically infers "distance" to other machines
without having to explicitly configure rack and datacenter positions solves two problems:
> The killer feature here is adapting to things like compaction or a failing-but-not-yet-dead
disk.  This is important, since when we are doing reads we pick the "closest" replica for
actually reading data from (and only read md5s from other replicas).  This means that if the
closest replica by network topology is temporarily slow due to compaction (for instance),
we'll have to block for its reply even if we get the other replies much much faster.
> Not having to manually re-sync your configuration with your network topology when changes
(adding machines) are made is a nice bonus.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message