incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Primary key question
Date Tue, 01 Jul 2014 22:36:47 GMT
The extra parentheses are used to indicate that the three columns constitute the “partition
key” – otherwise only the first column of the primary key would be the partition key.
The partition key indicates which data rows will be stored contiguously on a single node of
the cluster. As written, each of your rows might or might not get distributed to different
nodes – each of your rows will have a distinct partition key. With Jens’ approach all
rows with the same message_source_id would be part of the same partition (with the same partition
key) and stored contiguously on the same node. Since you only have 30,000 rows, it probably
doesn’t matter which way you go – organize your data based on how it is logically structured
and how you wish to access it.

-- Jack Krupansky

From: Wim Deblauwe 
Sent: Tuesday, July 1, 2014 8:24 AM
To: user@cassandra.apache.org 
Subject: Re: Primary key question

Hi, 

thanks for the tip, but I never need to query the traffic_data_types and integration_periods
for a single message_source, so I will keep the double bracket notation then for now.

Thanks,

Wim



2014-07-01 12:03 GMT+02:00 Jens Rantil <jens.rantil@tink.se>:

  Hi again, 

  As a follow-up; if you have many `message_source_id`s you could also do:

  CREATE TABLE integration_time (
  message_source_id uuid,
  traffic_data_type varchar,
  integration_period varchar,
  integration_time timestamp,
  PRIMARY KEY (message_source_id,traffic_data_type,integration_period)
  );

  This might enable you to easier be able to query all traffic_data_types and integration_periods
for a single message_source_id without having to do a heavy query across all of your cluster.
You'll have the same uniqueness property but this might, depending on your application, make
things more debuggable. The flip side is that your cluster could be slightly more unbalanced
if each message_source_id has a varied number of `integration_time`s.

  Just an idea,
  Jens



  On Tue, Jul 1, 2014 at 8:37 AM, Wim Deblauwe <wim.deblauwe@gmail.com> wrote:

    Hi, 

    I have the following table:

    CREATE TABLE integration_time (
    message_source_id uuid,
    traffic_data_type varchar,
    integration_period varchar,
    integration_time timestamp,
    PRIMARY KEY ((message_source_id,traffic_data_type,integration_period))
    );

    I want the combination of (message_source_id, traffic_data_type, integration_period) to
be unique. Is this the correct way to do it (with the double brackets) ?

    This table will be relative small, it just stores the last time something was done in
the application for that unique combination of those 3 parameters. Worst case there will be
30000 rows in that table and they will always be fetched by quering on the 3 parameters at
the same time.

    regards,

    Wim


Mime
View raw message