On Mon, Jun 4, 2012 at 2:34 PM, aaron morton <aaron@thelastpickle.com> wrote:
IIRC index slices work a little differently with consistency, they need to have CL level nodes available for all token ranges. If you drop it to CL ONE the read is local only for a particular token range. 

Yes, this is what we observed. When I reasoned my way through what I knew about how secondary indexes work, I came to the same conclusion about all token ranges having to be available. 

My surprise at the behavior was because I hadn't reasoned my way through it until we had the issue. Somehow I doubt I'm the only user of secondary indexes that was unaware of this ramification of CL choice. It might be a good idea for the documentation to reflect the tradeoffs more clearly.

Thanks for you help!

Jim
 

The problem when doing index reads is the nodes that contain the results can no longer be selected by the partitioner. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 2/06/2012, at 5:15 AM, Jim Ancona wrote:

Hi,

We have an application with two code paths, one of which uses a secondary index query and the other, which doesn't. While testing node down scenarios in our cluster we got a result which surprised (and concerned) me, and I wanted to find out if the behavior we observed is expected.

Background:
  • 6 nodes in the cluster (in order: A, B, C, E, F and G)
  • RF = 3
  • All operations at QUORUM
  • Operation 1: Read by row key followed by write
  • Operation 2: Read by secondary index, followed by write
While running a mixed workload of operations 1 and 2, we got the following results:

Scenario Result
All nodes up All operations succeed
One node down All operations succeed
Nodes A and E down All operations succeed
Nodes A and B down Operation 1: ~33% fail
Operation 2: All fail
Nodes A and C down Operation 1: ~17% fail
Operation 2: All fail

We had expected (perhaps incorrectly) that the secondary index reads would fail in proportion to the portion of the ring that was unable to reach quorum, just as the row key reads did. For both operation types the underlying failure was an UnavailableException.

The same pattern repeated for the other scenarios we tried. The row key operations failed at the expected ratios, given the portion of the ring that was unable to meet quorum because of nodes down, while all the secondary index reads failed as soon as 2 out of any 3 adjacent nodes were down.

Is this an expected behavior? Is it documented anywhere? I didn't find it with a quick search.

The operation doing secondary index query is an important one for our app, and we'd really prefer that it degrade gracefully in the face of cluster failures. My plan at this point is to do that query at ConsistencyLevel.ONE (and accept the increased risk of inconsistency). Will that work?

Thanks in advance,

Jim