cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Hobbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11320) Improve backoff policy for cqlsh COPY FROM
Date Wed, 30 Mar 2016 18:55:25 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218590#comment-15218590
] 

Tyler Hobbs commented on CASSANDRA-11320:
-----------------------------------------

The patch looks good so far!  Here are some review comments:

* The max backoff of 2^32 seconds is way too high.  I feel like we should start lower (maybe
10ms?) and set {{maxbackoffattempts}} to 12, for a max sleep of ~40 seconds.
* I'm also not sure if we should expose the three new config options to users.  They're advanced
enough that I don't think anybody will realistically be able to tune them well.  As long as
we pick decent defaults, I think they're okay being internal only.  (Plus, in a worst-case
scenario, you can directly edit the .py files.)
* In {{feed()}} (inside {{init_feeding_thread()}}), should we only be debug printing the exception?
 It seems like we should at least be using printmsg(), but I could be missing a good reason
to keep it at the debug level.  If there is a good reason, add a code comment about it.
* I think {{filter_replica()}} can be renamed to {{replica_is_not_overloaded()}} for more
clarity.
* {{ConnectionWrapper.connections}} can be declared inside of the class as a class-level attribute.
* Should we make {{maxinserterrors}} default to high non-negative value, something like 1k?
 Right now, if your cluster goes down, it can take a _long_ time for COPY to error out the
entire CSV file.


There's also a pretty minor issue: with {{--debug}} and a node that's slow to respond, I sometimes
see a few {{ImportError}}s like this:
{noformat}
Replicas too busy, given up
Traceback (most recent call last):
  File "/home/thobbs/cassandra/bin/cqlsh.py", line 2618, in main
    from tzlocal import get_localzone
ImportError: No module named tzlocal
{noformat}

> Improve backoff policy for cqlsh COPY FROM
> ------------------------------------------
>
>                 Key: CASSANDRA-11320
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11320
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>              Labels: doc-impacting
>             Fix For: 3.0.x, 3.x
>
>
> Currently we have an exponential back-off policy in COPY FROM that kicks in when timeouts
are received. However there are two limitations:
> * it does not cover new requests and therefore we may not back-off sufficiently to give
time to an overloaded server to recover
> * the pause is performed in the receiving thread and therefore we may not process server
messages quickly enough
> There is a static throttling mechanism in rows per second from feeder to worker processes
(the INGESTRATE) but the feeder has no idea of the load of each worker process. However it's
easy to keep track of how many chunks a worker process has yet to read by introducing a bounded
semaphore.
> The idea is to move the back-off pauses to the worker processes main thread so as to
include all messages, new and retries, not just the retries that timed out. The worker process
will not read new chunks during the back-off pauses, and the feeder process can then look
at the number of pending chunks before sending new chunks to a worker process.
> [~aholmber], [~aweisberg] what do you think?  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message