2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov>
I am confused. In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…
I have both needs. These are the two queries I need to perform on the model.
For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests. If you queried all partitions, you would most likely blow out your JVM memory.
For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user
Now I think I got the main idea! This answered a lot!
Sorry, I was skipping some context. A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition. I believe the same is true in cassandra for their indexing.
Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for now, I would like to see how to set up the connection with it.
Cassandra spreads all your data out on all nodes with or without partitions. A single partition does have it's data co-located though.
Now I see. The main advantage of using partitions is keeping the indexes small enough. It has nothing to do with the nodes. Thanks!
If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF. If your requests are rather large, you probably don't want to embed them in the User. Either way, it's one query or one row key lookup.
I see it now.
Multiget ignores partitions…you feed it a LIST of keys and it gets them. It just so happens that partitionId had to be part of your row key.
Do you mean I need to load all the keys in memory to do a multiget?
I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones. I believe the latter is true but am not 100% sure as I have not looked at that code.
Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more high level though
As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example. PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor. That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.
I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?
Also, CQL locates all the data on one node for a partition. We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored. An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins). It really depends how much data is going to come back in the query though too? There are tradeoff's between disk parallel nodes and having your data all on one node of course.
I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:
@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),
What would happen behind the scenes when I execute this query? You can only use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?
Thanks a lot for the answers