cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Mainaud <jer...@mainaud.com>
Subject Re: Example Data Modelling
Date Sat, 11 Jul 2015 21:56:47 GMT
Hi Carlos,

Columns in primary key like EmpID can't be static.
But remind that EmpID as in the partition key is not duplicated.

-- 
Jérôme Mainaud
jerome@mainaud.com

2015-07-07 16:02 GMT+02:00 Carlos Alonso <info@mrcalonso.com>:

> Hi Jerome,
>
> Good point!! Really a nice usage of static columns! BTW, wouldn't the
> EmpID be static as well?
>
> Cheers
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
> On 7 July 2015 at 14:42, Jérôme Mainaud <jerome@mainaud.com> wrote:
>
>> Hello,
>>
>> You can slightly adapt Carlos answer to reduce repliation of data that
>> don't change for month to month.
>> Static columns are great for this.
>>
>> The table become:
>>
>> CREATE TABLE salaries (
>>   EmpID varchar,
>>   FN varchar *static*,
>>   LN varchar *static*,
>>   Phone varchar *static*,
>>   Address varchar *static*,
>>   month integer,
>>   basic integer,
>>   flexible_allowance float,
>>   PRIMARY KEY(EmpID, month)
>> )
>>
>> There is only one copy of static column per partition the value is shared
>> between all rows of the partition.
>> When Employee data change you can update it with the partition key in the
>> where clause.
>> When you insert a new month entry you just fill non static columns.
>> The table can be queried the same way as the original one.
>>
>> Cheers
>>
>>
>>
>> --
>> Jérôme Mainaud
>> jerome@mainaud.com
>>
>> 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services <
>> rory.bramwell@devopservices.com>:
>>
>>> Hi,
>>>
>>> I've been following this thread and my thoughts are inline with Carlos'
>>> latest response... Model your data to suite your queries. That is one of
>>> the data model / design considerations in Cassandra that differs from the
>>> RDBMS world. Embrace demoralization and data duplication. Disk space is
>>> cheapest, so exploit how your data is laid out in order to optimize for
>>> faster reads (which are more costly than writes).
>>>
>>> Regards,
>>>
>>> Rory Bramwell
>>> Founder and CEO
>>> DevOp Services
>>>
>>> Skype: devopservices
>>> Email: rory.bramwell@devopservices.com
>>> Web: www.devopservices.com
>>> On Jul 7, 2015 4:02 AM, "Carlos Alonso" <info@mrcalonso.com> wrote:
>>>
>>>> I guess you're right, using my proposal, getting last employee's record
>>>> is straightforward and quick, but also, as Peter pointed, getting all slips
>>>> for a particular month requires you to know all the employee IDs and,
>>>> ideally, run a query for each employee. This would work depending on how
>>>> many employees you're managing.
>>>>
>>>> At this moment I'm beginning to feel that maybe using both approaches
>>>> is the best way to go. And I think this is one of Cassandra's
>>>> recommendations: Write your data in several formats if required to fit your
>>>> reads. Therefore I'd use my suggestion for getting a salary by employee ID
>>>> and I'd also have Peter's one to run the end of the month query.
>>>> Does it make sense?
>>>>
>>>> Cheers!
>>>>
>>>> Carlos Alonso | Software Engineer | @calonso
>>>> <https://twitter.com/calonso>
>>>>
>>>> On 7 July 2015 at 09:07, Srinivasa T N <seenutn@gmail.com> wrote:
>>>>
>>>>> Thanks for the inputs.
>>>>>
>>>>> Now my question is how should the app populate the duplicate data,
>>>>> i.e., if I have an employee record (along with his FN, LN,..) for the
month
>>>>> of Apr and later I am populating the same record for the month of may
(with
>>>>> salary changed), should my application first read/fetch the corresponding
>>>>> data for apr and re-insert with modification for month of may?
>>>>>
>>>>> Regards,
>>>>> Seenu.
>>>>>
>>>>> On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded <Oded.Peer@rsa.com>
wrote:
>>>>>
>>>>>>  The data model suggested isn’t optimal for the “end of month”
query
>>>>>> you want to run since you are not querying by partition key.
>>>>>>
>>>>>> The query would look like “select EmpID, FN, LN, basic from salaries
>>>>>> where month = 1” which requires filtering and has unpredictable
performance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For this type of query to be fast you can use the “month” column
as
>>>>>> the partition key and the “EmpID” and the clustering column.
>>>>>>
>>>>>> This approach also has drawbacks:
>>>>>>
>>>>>> 1. This data model creates a wide row. Depending on the number of
>>>>>> employees this partition might be very large. You should limit partition
>>>>>> sizes to 25MB
>>>>>>
>>>>>> 2. Distributing data according to month means that only a small
>>>>>> number of nodes will hold all of the salary data for a specific month
which
>>>>>> might cause hotspots on those nodes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Choose the approach that works best for you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Carlos Alonso [mailto:info@mrcalonso.com]
>>>>>> *Sent:* Monday, July 06, 2015 7:04 PM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: Example Data Modelling
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Srinivasa,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think you're right, In Cassandra you should favor denormalisation
>>>>>> when in RDBMS you find a relationship like this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'd suggest a cf like this
>>>>>>
>>>>>> CREATE TABLE salaries (
>>>>>>
>>>>>>   EmpID varchar,
>>>>>>
>>>>>>   FN varchar,
>>>>>>
>>>>>>   LN varchar,
>>>>>>
>>>>>>   Phone varchar,
>>>>>>
>>>>>>   Address varchar,
>>>>>>
>>>>>>   month integer,
>>>>>>
>>>>>>   basic integer,
>>>>>>
>>>>>>   flexible_allowance float,
>>>>>>
>>>>>>   PRIMARY KEY(EmpID, month)
>>>>>>
>>>>>> )
>>>>>>
>>>>>>
>>>>>>
>>>>>> That way the salaries will be partitioned by EmpID and clustered
by
>>>>>> month, which I guess is the natural sorting you want.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope it helps,
>>>>>>
>>>>>> Cheers!
>>>>>>
>>>>>>
>>>>>>   Carlos Alonso | Software Engineer | @calonso
>>>>>> <https://twitter.com/calonso>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6 July 2015 at 13:01, Srinivasa T N <seenutn@gmail.com>
wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>    I have basic doubt: I have an RDBMS with the following two tables:
>>>>>>
>>>>>>    Emp - EmpID, FN, LN, Phone, Address
>>>>>>    Sal - Month, Empid, Basic, Flexible Allowance
>>>>>>
>>>>>>    My use case is to print the Salary slip at the end of each month
>>>>>> and the slip contains emp name and his other details.
>>>>>>
>>>>>>    Now, if I want to have the same in cassandra, I will have a single
>>>>>> cf with emp personal details and his salary details.  Is this the
right
>>>>>> approach?  Should we have the employee personal details duplicated
each
>>>>>> month?
>>>>>>
>>>>>> Regards,
>>>>>> Seenu.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message