cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rory Bramwell, DevOp Services" <rory.bramw...@devopservices.com>
Subject Re: Example Data Modelling
Date Tue, 07 Jul 2015 09:51:48 GMT
Hi,

I've been following this thread and my thoughts are inline with Carlos'
latest response... Model your data to suite your queries. That is one of
the data model / design considerations in Cassandra that differs from the
RDBMS world. Embrace demoralization and data duplication. Disk space is
cheapest, so exploit how your data is laid out in order to optimize for
faster reads (which are more costly than writes).

Regards,

Rory Bramwell
Founder and CEO
DevOp Services

Skype: devopservices
Email: rory.bramwell@devopservices.com
Web: www.devopservices.com
On Jul 7, 2015 4:02 AM, "Carlos Alonso" <info@mrcalonso.com> wrote:

> I guess you're right, using my proposal, getting last employee's record is
> straightforward and quick, but also, as Peter pointed, getting all slips
> for a particular month requires you to know all the employee IDs and,
> ideally, run a query for each employee. This would work depending on how
> many employees you're managing.
>
> At this moment I'm beginning to feel that maybe using both approaches is
> the best way to go. And I think this is one of Cassandra's recommendations:
> Write your data in several formats if required to fit your reads. Therefore
> I'd use my suggestion for getting a salary by employee ID and I'd also have
> Peter's one to run the end of the month query.
> Does it make sense?
>
> Cheers!
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
> On 7 July 2015 at 09:07, Srinivasa T N <seenutn@gmail.com> wrote:
>
>> Thanks for the inputs.
>>
>> Now my question is how should the app populate the duplicate data, i.e.,
>> if I have an employee record (along with his FN, LN,..) for the month of
>> Apr and later I am populating the same record for the month of may (with
>> salary changed), should my application first read/fetch the corresponding
>> data for apr and re-insert with modification for month of may?
>>
>> Regards,
>> Seenu.
>>
>> On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded <Oded.Peer@rsa.com> wrote:
>>
>>>  The data model suggested isn’t optimal for the “end of month” query
>>> you want to run since you are not querying by partition key.
>>>
>>> The query would look like “select EmpID, FN, LN, basic from salaries
>>> where month = 1” which requires filtering and has unpredictable performance.
>>>
>>>
>>>
>>> For this type of query to be fast you can use the “month” column as the
>>> partition key and the “EmpID” and the clustering column.
>>>
>>> This approach also has drawbacks:
>>>
>>> 1. This data model creates a wide row. Depending on the number of
>>> employees this partition might be very large. You should limit partition
>>> sizes to 25MB
>>>
>>> 2. Distributing data according to month means that only a small number
>>> of nodes will hold all of the salary data for a specific month which might
>>> cause hotspots on those nodes.
>>>
>>>
>>>
>>> Choose the approach that works best for you.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Carlos Alonso [mailto:info@mrcalonso.com]
>>> *Sent:* Monday, July 06, 2015 7:04 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Example Data Modelling
>>>
>>>
>>>
>>> Hi Srinivasa,
>>>
>>>
>>>
>>> I think you're right, In Cassandra you should favor denormalisation when
>>> in RDBMS you find a relationship like this.
>>>
>>>
>>>
>>> I'd suggest a cf like this
>>>
>>> CREATE TABLE salaries (
>>>
>>>   EmpID varchar,
>>>
>>>   FN varchar,
>>>
>>>   LN varchar,
>>>
>>>   Phone varchar,
>>>
>>>   Address varchar,
>>>
>>>   month integer,
>>>
>>>   basic integer,
>>>
>>>   flexible_allowance float,
>>>
>>>   PRIMARY KEY(EmpID, month)
>>>
>>> )
>>>
>>>
>>>
>>> That way the salaries will be partitioned by EmpID and clustered by
>>> month, which I guess is the natural sorting you want.
>>>
>>>
>>>
>>> Hope it helps,
>>>
>>> Cheers!
>>>
>>>
>>>   Carlos Alonso | Software Engineer | @calonso
>>> <https://twitter.com/calonso>
>>>
>>>
>>>
>>> On 6 July 2015 at 13:01, Srinivasa T N <seenutn@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>>    I have basic doubt: I have an RDBMS with the following two tables:
>>>
>>>    Emp - EmpID, FN, LN, Phone, Address
>>>    Sal - Month, Empid, Basic, Flexible Allowance
>>>
>>>    My use case is to print the Salary slip at the end of each month and
>>> the slip contains emp name and his other details.
>>>
>>>    Now, if I want to have the same in cassandra, I will have a single cf
>>> with emp personal details and his salary details.  Is this the right
>>> approach?  Should we have the employee personal details duplicated each
>>> month?
>>>
>>> Regards,
>>> Seenu.
>>>
>>>
>>>
>>
>>
>

Mime
View raw message