incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Exactly one wide row per node for a given CF?
Date Tue, 10 Dec 2013 07:33:56 GMT
> But this becomes troublesome if I add or remove nodes. What effectively I want is to partition
on the unique id of the record modulus N (id % N; where N is the number of nodes).
This is exactly the problem consistent hashing (used by cassandra) is designed to solve. If
you hash the key and modulo the number of nodes, adding and removing nodes requires a lot
of data to move. 

> I want to be able to randomly distribute a large set of records but keep them clustered
in one wide row per node.
Sounds like you should revisit your data modelling, this is a pretty well known anti pattern.


When rows get above a few 10’s  of MB things can slow down, when they get above 50 MB they
can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB,
well you you don’t want to know what happens then. 

It’s a bad idea and you should take another look at the data model. If you have to do it,
you can try the ByteOrderedPartitioner which uses the row key as a token, given you total
control of the row placement. 

Cheers


-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 4/12/2013, at 8:32 pm, Vivek Mishra <mishra.vivs@gmail.com> wrote:

> So Basically you want to create a cluster of multiple unique keys, but data which belongs
to one unique should be colocated. correct?
> 
> -Vivek
> 
> 
> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespending@gmail.com> wrote:
> Subject says it all. I want to be able to randomly distribute a large set of records
but keep them clustered in one wide row per node.
> 
> As an example, lets say I’ve got a collection of about 1 million records each with
a unique id. If I just go ahead and set the primary key (and therefore the partition key)
as the unique id, I’ll get very good random distribution across my server cluster. However,
each record will be its own row. I’d like to have each record belong to one large wide row
(per server node) so I can have them sorted or clustered on some other column.
> 
> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 at the
time of creation and have the partition key set to this value. But this becomes troublesome
if I add or remove nodes. What effectively I want is to partition on the unique id of the
record modulus N (id % N; where N is the number of nodes).
> 
> I have to imagine there’s a mechanism in Cassandra to simply randomize the partitioning
without even using a key (and then clustering on some column).
> 
> Thanks for any help.
> 


Mime
View raw message