My high-level understanding of how Cassandra handles a SELECT is that :
      (excuse incorrect terminology)
  1.  client connects to some node N
  2.  node N acts as a kind of coordinator and fires off the thrift or binary-protocol messages
      to all other nodes to fetch rows off the memtables and/or disks
The internode messages are a custom binary protocol, not the thrift / native api messages. These messages are also used on the node to move your request into the appropriate thread pooll.

The nodes reads the data needed for the request as if it was the only node performing the request. The only time we act differently is when sending the data back to the coordinator. 
 
  3.   coordinator merges,  truncates,  etc the sets from the nodes and returns one answer set to client.

The coordinator simply compares the results from the replicas and determines if the match. It does not merge or truncate. 

If they do not match we perform the read again, but this time transmit some extra data so we can resolve differences. 

It is step 3 which has me wondering  -   does it explicitly preserve the on-disk order?
Order from the on disk read (including reverse ordered in the select statement) is preserved in the serialisation process. After which we never order again. 
 
In fact  -  does it simply keep each individual node's answer set separate?   Is that how it works?

I did some recent webinars for PlanetCassandra that may help:

Introduction to Apache Cassandra 1.2
http://thelastpickle.com/speaking/2013/04/25/Community-Webinar.html

Talks about the read / write and cluster process at a high level. 

Cassandra Internals
http://thelastpickle.com/speaking/2013/08/25/Cassandra-Community-Webinar.html 

Goes deep into the code to explain how cassandra works. 

Hope that helps. 




-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 13/09/2013, at 1:11 AM, John Lumby <johnlumby@hotmail.com> wrote:

Aaron,   thanks for the super-rapid response.    That clarifies a lot for me,
but I think I am still wondering about one point embedded below.

________________________________
From: aaron@thelastpickle.com
Subject: Re: is the select result grouped by the value of the partition key?
Date: Thu, 12 Sep 2013 14:19:06 +1200
To: user@cassandra.apache.org

GROUP BY "feature",
I would not think of it like that, this is about physical order of rows.

since it seems really important yet does not seem to be mentioned in the
CQL reference documentation.
It's baked in, this is how the data is organised on the row.

Yes,   I see,   and I absolutely get the relevance of where columns are stored on disk to,
say,  doing INSERTs.
But what I am wondering about is,  in the context of a SELECT,    we seem to be relying on
the Cassandra client api preserving that on-disk order while returning rows.
My high-level understanding of how Cassandra handles a SELECT is that :
      (excuse incorrect terminology)
  1.  client connects to some node N
  2.  node N acts as a kind of coordinator and fires off the thrift or binary-protocol messages
      to all other nodes to fetch rows off the memtables and/or disks
  3.   coordinator merges,  truncates,  etc the sets from the nodes and returns one answer set to client.

It is step 3 which has me wondering  -   does it explicitly preserve the on-disk order?
In fact  -  does it simply keep each individual node's answer set separate?   Is that how it works?


http://www.datastax.com/dev/blog/thrift-to-cql3
We often say the PRIMARY KEY is the PARTITION KEY and the GROUPING COLUMNS
http://www.datastax.com/documentation/cql/3.0/webhelp/index.html#cql/cql_reference/create_table_r.html

See also http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html

Is it something we can bet the farm and farmer's family on?
Sure.

The kinds of scenarios where I am wondering if it's possible for  
partition-key groups
to get intermingled are :
All instances of the table entity with the same value(s) for the  
PARTITION KEY portion of the PRIMARY KEY existing in the same storage  
engine row.

  .   what if the node containing primary copy of a row is down
There is no primary copy of a row.

  .   what if there is a heavy stream of UPDATE activity from  
applications which
      connect to all nodes,   causing different nodes to have different  
versions of replicas of same row?
That's fine with me.
It's only an issue when the data is read, and at that point the  
Consistency Level determines what we do.

Hope that helps.


-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/09/2013, at 7:43 AM, John Lumby  
<johnlumby@hotmail.com<mailto:johnlumby@hotmail.com>> wrote:

I would like to make quite sure about this implicit GROUP BY "feature",

since it seems really important yet does not seem to be mentioned in the
CQL reference documentation.



Aaron,   you said "yes"  --   is that "yes,  always,   in all scenarios  
no matter what"

or "yes usually"?      Is it something we can bet the farm and farmer's  
family on?



The kinds of scenarios where I am wondering if it's possible for  
partition-key groups
to get intermingled are :



  .   what if the node containing primary copy of a row is down
                and
cassandra fetches this row from a replica on a different node
               (e.g.  with CONSISTENCY ONE)

  .   what if there is a heavy stream of UPDATE activity from  
applications which
      connect to all nodes,   causing different nodes to have different  
versions of replicas of same row?



Can you point me to some place in the cassandra source code where this  
grouping is ensured?



Many thanks,

John Lumby