On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous.

So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows?

QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there.

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query.