On Wed, Jul 7, 2010 at 11:33 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
Having a few requests time out while the service detects badness is
typical in this kind of system.  I don't think writing a completely
separate StorageProxy + supporting classes to allow avoiding this in
exchange for RF times the network bandwidth is a good idea.

My suggestion of turning off (or tuning) the full-data/md5 optimization assumes that exposing this a configuration option would be less work and less complicated than dynamically detecting and routing around slow nodes. From your reply, it sounds as if this assumption does not hold. I assumed (without looking at the code) that the existing StorageProxy could expose this as a configuration option without requiring a lot of additional work and certainly without requiring an entirely separate set of supporting classes.

I'm curious of what performance benefit is actually being gained from this optimization. Has this benefit been tested and measured? Since the benefit would depend greatly on the size of the data being requested, for smallish data requests the performance improvement would be negligible, correct? Given the reliability downside is rather severe, this feels like a trade-off a system administrator would like to be able to make.

Sorry to keep beating this horse, but we're regularly being alerted to performance issues any time a mini-compaction occurs on any node in our cluster. I'm fishing for a quick and easy way to prevent this from happening.