incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Elias Del Valle <mvall...@gmail.com>
Subject Re: Correct model
Date Mon, 24 Sep 2012 19:54:55 GMT
Dean, this sounds like magic :D
I don't know details about the performance on the index implementations you
chose, but it would pay the way to use it in my case, as I don't need the
best performance in the world when reading, but I need to assure
scalability and have a simple model to maintain. I liked the playOrm
concept regarding this.
I have more doubts, but I will ask them at stack over flow from now on.

2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov>

> PlayOrm will automatically create a CF to index my CF?
>
> It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and
> StringIndice such that the ad-hoc tool that is in development can display
> the indices as it knows the prefix of the composite column name is of
> Integer, Decimal or String and it knows the postfix type as well so it can
> translate back from bytes to the types and properly display in a GUI (i.e.
> On top of SELECT, the ad-hoc tool is adding a way to view the induce rows
> so you can check if they got corrupt or not).
>
> Will it auto-manage it, like Cassandra's secondary indexes?
>
> YES
>
> Further detail…
>
> You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the
> index as you add/modify/remove the entity…..a modify does a remove old val
> from index and insert new value into index.
>
> An example would be PlayOrm stores all long, int, short, byte in a type
> that uses the least amount of space so IF you have a long OR BigInteger
> between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons
> of space!!!).  Then if you are indexing a type that is one of those,
> PlayOrm creates a IntegerIndice table.
>
> Right now, another guy is working on playorm-server which is a webgui to
> allow ad-hoc access to all your data as well so you can ad-hoc queries to
> see data and instead of showing Hex, it shows the real values by
> translating the bytes to String for the schema portions that it is aware of
> that is.
>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Monday, September 24, 2012 12:09 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Correct model
>
> Dean,
>
>     There is one last thing I would like to ask about playOrm by this
> list, the next questiosn will come by stackOverflow. Just because of the
> context, I prefer asking this here:
>      When you say playOrm indexes a table (which would be a CF behind the
> scenes), what do you mean? PlayOrm will automatically create a CF to index
> my CF? Will it auto-manage it, like Cassandra's secondary indexes?
>      In Cassandra, the application is responsible for maintaining the
> index, right? I might be wrong, but unless I am using secondary indexes I
> need to update index values manually, right?
>      I got confused when you said "PlayOrm indexes the columns you
> choose". How do I choose and what exactly it means?
>
> Best regards,
> Marcelo Valle.
>
> 2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
> Oh, ok, you were talking about the wide row pattern, right?
>
> yes
>
> But playORM is compatible with Aaron's model, isn't it?
>
> Not yet, PlayOrm supports partitioning one table multiple ways as it
> indexes the columns(in your case, the userid FK column and the time column)
>
> Can I map exactly this using playORM?
>
> Not yet, but the plan is to map these typical Cassandra scenarios as well.
>
>  Can I ask playOrm questions in this list?
>
> The best place to ask PlayOrm questions is on stack overflow and tag with
> PlayOrm though I monitor this list and stack overflow for questions(there
> are already a few questions on stack overflow).
>
> The examples directory is empty for now, I would like to see how to set up
> the connection with it.
>
> Running build or build.bat is always kept working and all 62 tests pass(or
> we don't merge to master) so to see how to make a connection or run an
> example
>
>  1.  Run build.bat or build which generates parsing code
>  2.  Import into eclipse (it already has .classpath and .project for you
> already there)
>  3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not
> and run any of the tests in-memory or against localhost(We run the test
> suite also against a 6 node cluster as well and all passes)
>  4.  FactorySingleton probably has the code you are looking for plus you
> need a class called nosql.Persistence or it won't scan your jar file.(class
> file not xml file like JPA)
>
> Do you mean I need to load all the keys in memory to do a multi get?
>
> No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not
> the results so you can loop through every key and behind the scenes it is
> doing batch requests so you can load up 100 keys and make one multi get
> request for those 100 keys and then can load up the next 100 keys, etc.
> etc. etc.  I need to look more into the apis and protocol of CQL to see if
> it allows this style of batching.  PlayOrm does support this style of
> batching today.  Aaron would know if CQL does.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it?
>
> At the time, I wanted the file streaming feature.  Also, Hector seemed a
> bit cumbersome as well compared to astyanax or at least if you were
> building a platform and had no use for typing the columns.  Just personal
> preference really here.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> PlayOrm indexes the columns you choose(ie. The ones you want to use in the
> where clause) and partitions by columns you choose not based on the key so
> in PlayOrm, the key is typically a TimeUUID or something cluster
> unique…..any tables referencing that TimeUUID never have to change.  With
> Cassandra partitioning, if you repartition that table a different way or go
> for some kind of major change(usually done with map/reduce), all your
> foreign keys "may" have to change….it really depends on the situation
> though.  Maybe you get the design right and never have to change.
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query?
>
> In this case, t or TABLE is a partitioned table since a partition is
> defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which
> is not partitioned(AND ActivityTypeInfo won't scale to billions of rows
> because there is no partitioning but maybe you don't need it!!!).  Behind
> the scenes when you call getResult, it returns a cursor that has NOT done
> anything yet.  When you start looping through the cursor, behind the scenes
> it is batching requests asking for next 500 matches(configurable) so you
> never run out of memory….it is EXACTLY like a database cursor.  You can
> even use the cursor to show a user the first set of results and when user
> clicks next pick up right where the cursor left off (if you saved it to the
> HttpSession).
>
> You can only use joins with partition keys, right?
>
> Nope, joins work on anything.  You only need to specify the partitionId
> when you have a partitioned table in the list of join tables. (That is what
> the PARTITIONS clause is for, to identify partitionId = what?)…it was put
> BEFORE the SQL instead of within it…CQL took the opposite approach but
> PlayOrm can also join different partitions together as well ;) ).
>
> In this case, is partId the row id of TABLE CF?
>
> Nope, partId is one of the columns.  There is a test case on this class in
> PlayOrm …(notice the annotation NoSqlPartitionByThisField on the
> column/field in the entity)…
>
>
> https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java
>
> PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned
> tables won't scale but maybe you will never have that many rows).  You can
> join any two combinations(non-partitioned with partitioned, non-partitioned
> with non-partitioned, partition with another partition).
>
> I only prefer stackoverflow as I like referencing links/questions with
> their urls.  To reference this email is very hard later on as I have to
> find it so in general, I HATE email lists ;) but it seems cassandra prefers
> them so any questions on PlayOrm you can put there and I am not sure how
> many on this may or may not be interested so it creates less noise on this
> list too.
>
> Later,
> Dean
>
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com><mailto:mvallebr@gmail.com<mailto:mvallebr@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
<
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Date: Monday, September 24, 2012 11:07 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Subject: Re: Correct model
>
>
>
> 2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov
> ><mailto:Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>>
> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>
> I have both needs. These are the two queries I need to perform on the
> model.
>
> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>
> Now I think I got the main idea! This answered a lot!
>
> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>
> Oh, ok, you were talking about the wide row pattern, right? But playORM is
> compatible with Aaron's model, isn't it? Can I map exactly this using
> playORM? The hardest thing for me to use playORM now is I don't know
> Cassandra well yet, and I know playORM even less. Can I ask playOrm
> questions in this list? I will try to create a POC here!
> Only now I am starting to understand what it does ;-) The examples
> directory is empty for now, I would like to see how to set up the
> connection with it.
>
> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>
> Now I see. The main advantage of using partitions is keeping the indexes
> small enough. It has nothing to do with the nodes. Thanks!
>
> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>
> I see it now.
>
> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>
> Do you mean I need to load all the keys in memory to do a multiget?
>
> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
> much more high level though
>
> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.
>
> I guess I am still not ready for this level of info. :D
> In the playORM readme, we have the following:
>
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query? You can
> only use joins with partition keys, right?
> In this case, is partId the row id of TABLE CF?
>
>
> Thanks a lot for the answers
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Mime
View raw message