incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Exactly one wide row per node for a given CF?
Date Tue, 10 Dec 2013 07:36:14 GMT
> Basically this desire all stems from wanting efficient use of memory. 
Do you have any real latency numbers you are trying to tune ? 

Otherwise this sounds a little like premature optimisation.

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 5/12/2013, at 6:16 am, onlinespending <onlinespending@gmail.com> wrote:

> Pretty much yes. Although I think it’d be nice if Cassandra handled such a case, I’ve
resigned to the fact that it cannot at the moment. The workaround will be to partition on
the LSB portion of the id (giving 256 rows spread amongst my nodes) which allows room for
scaling, and then cluster each row on geohash or something else.
> 
> Basically this desire all stems from wanting efficient use of memory. Frequently accessed
keys’ values are kept in RAM through the OS page cache. But the page size is 4KB. This is
a problem if you are accessing several small records of data (say 200 bytes), since each record
only occupies a small % of a page. This is why it’s important to increase the probability
that neighboring data on the disk is relevant. Worst case would be to read in a full 4KB page
into RAM, of which you’re only accessing one record that’s a couple hundred bytes. All
of the other unused data of the page is wastefully occupying RAM. Now project this problem
to a collection of millions of small records all indiscriminately and randomly scattered on
the disk, and you can easily see how inefficient your memory usage will become.
> 
> That’s why it’s best to cluster data in some meaningful way, all in an effort to
increasing the probability that when one record is accessed in that 4KB block that its neighboring
records will also be accessed. This brings me back to the question of this thread. I want
to randomly distribute the data amongst the nodes to avoid hot spotting, but within each node
I want to cluster the data meaningfully such that the probability that neighboring data is
relevant is increased.
> 
> An example of this would be having a huge collection of small records that store basic
user information. If you partition on the unique user id, then you’ll get nice random distribution
but with no ability to cluster (each record would occupy its own row). You could partition
on say geographical region, but then you’ll end up with hot spotting when one region is
more active than another. So ideally you’d like to randomly assign a node to each record
to increase parallelism, but then cluster all records on a node by say geohash since it is
more likely (however small that may be) that when one user from a geographical region is accessed
other users from the same region will also need to be accessed. It’s certainly better than
having some random user record next to the one you are accessing at the moment.
> 
> 
> 
> 
> On Dec 3, 2013, at 11:32 PM, Vivek Mishra <mishra.vivs@gmail.com> wrote:
> 
>> So Basically you want to create a cluster of multiple unique keys, but data which
belongs to one unique should be colocated. correct?
>> 
>> -Vivek
>> 
>> 
>> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespending@gmail.com>
wrote:
>> Subject says it all. I want to be able to randomly distribute a large set of records
but keep them clustered in one wide row per node.
>> 
>> As an example, lets say I’ve got a collection of about 1 million records each with
a unique id. If I just go ahead and set the primary key (and therefore the partition key)
as the unique id, I’ll get very good random distribution across my server cluster. However,
each record will be its own row. I’d like to have each record belong to one large wide row
(per server node) so I can have them sorted or clustered on some other column.
>> 
>> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 at
the time of creation and have the partition key set to this value. But this becomes troublesome
if I add or remove nodes. What effectively I want is to partition on the unique id of the
record modulus N (id % N; where N is the number of nodes).
>> 
>> I have to imagine there’s a mechanism in Cassandra to simply randomize the partitioning
without even using a key (and then clustering on some column).
>> 
>> Thanks for any help.
>> 
> 


Mime
View raw message