incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Elias Del Valle <mvall...@gmail.com>
Subject Re: Correct model
Date Mon, 24 Sep 2012 13:45:15 GMT
2012/9/23 Hiller, Dean <Dean.Hiller@nrel.gov>

> You need to split data among partitions or your query won't scale as more
> and more data is added to table.  Having the partition means you are
> querying a lot less rows.
>
This will happen in case I can query just one partition. But if I need to
query things in multiple partitions, wouldn't it be slower?


> He means determine the ONE partition key and query that partition.  Ie. If
> you want just latest user requests, figure out the partition key based on
> which month you are in and query it.  If you want the latest independent of
> user, query the correct single partition for GlobalRequests CF.
>

But in this case, I didn't understand Aaron's model then. My first query is
to get  all requests for a user. If I did partitions by time, I will need
to query all partitions to get the results, right? In his answer it was
said I would query ONE partition...


> If I want all the requests for the user, couldn't I just select all
> UserRequest records which start with "userId"?
> He designed it so the user requests table was completely scalable so he
> has partitions there.  If you don't have partitions, you could run into a
> row that is toooo long.  You don't need to design it this way if you know
> none of your users are going to go into the millions as far as number of
> requests.  In his design then, you need to pick the correct partition and
> query into that partition.
>
You mean too many rows, not a row too long, right? I am assuming each
request will be a different row, not a new column. Is having billions of
ROWS something non performatic in Cassandra? I know Cassandra allows up to
2 billion columns for a CF, but I am not aware of a limitation for rows...


> I really didn't understand why to use partitions.
> Partitions are a way if you know your rows will go into the trillions of
> breaking them up so each partition has 100k rows or so or even 1 million
> but maxes out in the millions most likely.  Without partitions, you hit a
> limit in the millions.  With partitions, you can keep scaling past that as
> you can have as many partitions as you want.
>

If I understood it correctly, if I don't specify partitions, Cassandra will
store all my data in a single node? I thought Cassandra would automatically
distribute my data among nodes as I insert rows into a CF. Of course if I
use partitions I understand I could query just one partition (node) to get
the data, if I have the partition field, but to the best of my knowledge,
this is not what happens in my case, right? In the first query I would have
to query all the partitions...
Or you are saying partitions have nothing to do with nodes?? I 99,999% of
my users will have less than 100k requests, would it make sense to
partition by user?


> A multi-get is a query that finds IN PARALLEL all the rows with the
> matching keys you send to cassandra.  If you do 1000 gets(instead of a
> multi-get) with 1ms latency, you will find, it takes 1 second+processing
> time.  If you do ONE multi-get, you only have 1 request and therefore 1ms
> latency.  The other solution is you could send 1000 "asycnh" gets but I
> have a feeling that would be slower with all the marshalling/unmarshalling
> of the envelopeā€¦..really depends on the envelope size like if we were using
> http, you would get killed doing 1000 requests instead of 1 with 1000 keys
> in it.
>
That's cool! :D So if I need to query data split in 10 partitions, for
instance, I can perform the query in parallel by using a multiget, right?
Out of curiosity, if each get will occur on a different node, I would need
to connect to each of the nodes? Or would I query 1 node and it would
communicate to others?


>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 10:23 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Correct model
>
>
> 2012/9/20 aaron morton <aaron@thelastpickle.com<mailto:
> aaron@thelastpickle.com>>
> I would consider:
>
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
>
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start
> of a time partition that makes sense in your domain. e.g. partition
> monthly. Generally want to avoid rows the grow forever, as a rule of thumb
> avoid rows more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data
> store the request in a single column using JSON or similar, with the column
> name a timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property>
> to store the request in many columns.
> * In either case consider using Reversed comparators so the most recent
> columns are first  see
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to
> use the same partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
>
> Ok, I think I understood your suggestion... But the only advantage in this
> solution is to split data among partitions? I understood how it would work,
> but I didn't understand why it's better than the other solution, without
> the GlobalRequests CF
>
> - Select all the requests for an user
> Work out the current partition client side, get the first N columns. Then
> page.
>
> What do you mean here by current partition? You mean I would perform a
> query for each particition? If I want all the requests for the user,
> couldn't I just select all UserRequest records which start with "userId"? I
> might be missing something here, but in my understanding if I use hector to
> query a column familly I can do that and Cassandra servers will
> automatically communicate to each other to get the data I need, right? Is
> it bad? I really didn't understand why to use partitions.
>
>
> - Select all the users which has new requests, since date D
> Worm out the current partition client side, get the first N columns from
> GlobalRequests, make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
>  For sure it is helping a lot. However, I don't know what is a multiget...
> I saw the hector api reference and found this method, but not sure about
> what Cassandra would do internally if I do a multiget... Is this expensive
> in terms of performance and latency?
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Mime
View raw message