incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Correct model
Date Sun, 23 Sep 2012 21:34:37 GMT
Yup.

(Multi get is just a convenience method, it explodes into multiple gets on the server side.
)

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2012, at 5:01 AM, "Hiller, Dean" <Dean.Hiller@nrel.gov> wrote:

> But the only advantage in this solution is to split data among partitions?
> 
> You need to split data among partitions or your query won't scale as more and more data
is added to table.  Having the partition means you are querying a lot less rows.
> 
> What do you mean here by current partition?
> 
> He means determine the ONE partition key and query that partition.  Ie. If you want just
latest user requests, figure out the partition key based on which month you are in and query
it.  If you want the latest independent of user, query the correct single partition for GlobalRequests
CF.
> 
> If I want all the requests for the user, couldn't I just select all UserRequest records
which start with "userId"?
> 
> He designed it so the user requests table was completely scalable so he has partitions
there.  If you don't have partitions, you could run into a row that is toooo long.  You don't
need to design it this way if you know none of your users are going to go into the millions
as far as number of requests.  In his design then, you need to pick the correct partition
and query into that partition.
> 
> I really didn't understand why to use partitions.
> 
> Partitions are a way if you know your rows will go into the trillions of breaking them
up so each partition has 100k rows or so or even 1 million but maxes out in the millions most
likely.  Without partitions, you hit a limit in the millions.  With partitions, you can keep
scaling past that as you can have as many partitions as you want.
> 
> A multi-get is a query that finds IN PARALLEL all the rows with the matching keys you
send to cassandra.  If you do 1000 gets(instead of a multi-get) with 1ms latency, you will
find, it takes 1 second+processing time.  If you do ONE multi-get, you only have 1 request
and therefore 1ms latency.  The other solution is you could send 1000 "asycnh" gets but I
have a feeling that would be slower with all the marshalling/unmarshalling of the envelopeā€¦..really
depends on the envelope size like if we were using http, you would get killed doing 1000 requests
instead of 1 with 1000 keys in it.
> 
> Later,
> Dean
> 
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 10:23 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Correct model
> 
> 
> 2012/9/20 aaron morton <aaron@thelastpickle.com<mailto:aaron@thelastpickle.com>>
> I would consider:
> 
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
> 
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start of a
time partition that makes sense in your domain. e.g. partition monthly. Generally want to
avoid rows the grow forever, as a rule of thumb avoid rows more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data store the request
in a single column using JSON or similar, with the column name a timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property> to store
the request in many columns.
> * In either case consider using Reversed comparators so the most recent columns are first
 see http://thelastpickle.com/2011/10/03/Reverse-Comparators/
> 
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to use the same
partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
> 
> Ok, I think I understood your suggestion... But the only advantage in this solution is
to split data among partitions? I understood how it would work, but I didn't understand why
it's better than the other solution, without the GlobalRequests CF
> 
> - Select all the requests for an user
> Work out the current partition client side, get the first N columns. Then page.
> 
> What do you mean here by current partition? You mean I would perform a query for each
particition? If I want all the requests for the user, couldn't I just select all UserRequest
records which start with "userId"? I might be missing something here, but in my understanding
if I use hector to query a column familly I can do that and Cassandra servers will automatically
communicate to each other to get the data I need, right? Is it bad? I really didn't understand
why to use partitions.
> 
> 
> - Select all the users which has new requests, since date D
> Worm out the current partition client side, get the first N columns from GlobalRequests,
make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
> For sure it is helping a lot. However, I don't know what is a multiget... I saw the hector
api reference and found this method, but not sure about what Cassandra would do internally
if I do a multiget... Is this expensive in terms of performance and latency?
> 
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr


Mime
View raw message