incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Elias Del Valle <mvall...@gmail.com>
Subject Re: Correct model
Date Mon, 24 Sep 2012 17:07:52 GMT
2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov>

> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>

I have both needs. These are the two queries I need to perform on the model.


> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>

Now I think I got the main idea! This answered a lot!


> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>

Oh, ok, you were talking about the wide row pattern, right? But playORM is
compatible with Aaron's model, isn't it? Can I map exactly this using
playORM? The hardest thing for me to use playORM now is I don't know
Cassandra well yet, and I know playORM even less. Can I ask playOrm
questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples
directory is empty for now, I would like to see how to set up the
connection with it.


> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>

Now I see. The main advantage of using partitions is keeping the indexes
small enough. It has nothing to do with the nodes. Thanks!


> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>

I see it now.


> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>

Do you mean I need to load all the keys in memory to do a multiget?


> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>

Why did you move? Hector is being considered for being the "official"
client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
much more high level though


> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>

I am not sure I understood this part. If I need to refactor, having the
partition id in the key would be a bad thing? What would be the
alternative? In my case, as I use userId : partitionId as row key, this
might be a problem, right?


> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.


I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId)
SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and
t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only
use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Mime
View raw message