incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiller, Dean" <Dean.Hil...@nrel.gov>
Subject Re: Correct model
Date Mon, 24 Sep 2012 17:41:26 GMT
Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes the columns(in
your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with PlayOrm though I
monitor this list and stack overflow for questions(there are already a few questions on stack
overflow).

The examples directory is empty for now, I would like to see how to set up the connection
with it.

Running build or build.bat is always kept working and all 62 tests pass(or we don't merge
to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and run any of
the tests in-memory or against localhost(We run the test suite also against a 6 node cluster
as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a class called
nosql.Persistence or it won't scan your jar file.(class file not xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the results so you
can loop through every key and behind the scenes it is doing batch requests so you can load
up 100 keys and make one multi get request for those 100 keys and then can load up the next
100 keys, etc. etc. etc.  I need to look more into the apis and protocol of CQL to see if
it allows this style of batching.  PlayOrm does support this style of batching today.  Aaron
would know if CQL does.

Why did you move? Hector is being considered for being the "official" client for Cassandra,
isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit cumbersome as
well compared to astyanax or at least if you were building a platform and had no use for typing
the columns.  Just personal preference really here.

I am not sure I understood this part. If I need to refactor, having the partition id in the
key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId
as row key, this might be a problem, right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the where clause) and
partitions by columns you choose not based on the key so in PlayOrm, the key is typically
a TimeUUID or something cluster unique…..any tables referencing that TimeUUID never have
to change.  With Cassandra partitioning, if you repartition that table a different way or
go for some kind of major change(usually done with map/reduce), all your foreign keys "may"
have to change….it really depends on the situation though.  Maybe you get the design right
and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as
t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  And t.activityTypeInfo
refers to the ActivityTypeInfo table which is not partitioned(AND ActivityTypeInfo won't scale
to billions of rows because there is no partitioning but maybe you don't need it!!!).  Behind
the scenes when you call getResult, it returns a cursor that has NOT done anything yet.  When
you start looping through the cursor, behind the scenes it is batching requests asking for
next 500 matches(configurable) so you never run out of memory….it is EXACTLY like a database
cursor.  You can even use the cursor to show a user the first set of results and when user
clicks next pick up right where the cursor left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when you have a partitioned
table in the list of join tables. (That is what the PARTITIONS clause is for, to identify
partitionId = what?)…it was put BEFORE the SQL instead of within it…CQL took the opposite
approach but PlayOrm can also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in PlayOrm …(notice
the annotation NoSqlPartitionByThisField on the column/field in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned tables won't scale
but maybe you will never have that many rows).  You can join any two combinations(non-partitioned
with partitioned, non-partitioned with non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their urls.  To reference
this email is very hard later on as I have to find it so in general, I HATE email lists ;)
but it seems cassandra prefers them so any questions on PlayOrm you can put there and I am
not sure how many on this may or may not be interested so it creates less noise on this list
too.

Later,
Dean


From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:mvallebr@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Monday, September 24, 2012 11:07 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
I am confused.  In this email you say you want "get all requests for a user" and in a previous
one you said "Select all the users which has new requests, since date D" so let me answer
both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF
which gives you the most recent requests ALONG with the user ids of those requests.  If you
queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id>
to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a
long row so in playOrm, too many rows in a partition means == too many columns in the indexing
row for that partition.  I believe the same is true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with
Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to
use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask
playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for
now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition
does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small enough. It
has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the requests in
the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather
large, you probably don't want to embed them in the User.  Either way, it's one query or one
row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens
that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax
3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but
am not 100% sure as I have not looked at that code.

Why did you move? Hector is being considered for being the "official" client for Cassandra,
isn't it? I looked at the Astyanax api and it seemed much more high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests
table and you partition by user AND time(two views into the same data partitioned two different
ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the
partition ids in the key leaving it free to partition twice like in your case….and in a
refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey>
whereas if you don't have partition id in the key, you only map/reduce the partitioned table
in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition
to PlayOrm partitioning even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the partition id in the
key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId
as row key, this might be a problem, right?

Also, CQL locates all the data on one node for a partition.  We have found it can be faster
"sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm
partitions are virtual only and do not relate to where the rows are stored.  An example on
our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't
compare to CQL here since it doesn't do joins).  It really depends how much data is going
to come back in the query though too?  There are tradeoff's between disk parallel nodes and
having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as
t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only use joins with
partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Mime
View raw message