cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: How to store denormalized data
Date Wed, 03 Jun 2015 14:55:13 GMT
Suggestion or rather food for thought....

Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?

Regards,
Shahab

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <matt.johnson@algomi.com>
wrote:

> Hi all,
>
>
>
> I am trying to store some data (user actions in our application) for
> future analysis (probably using Spark). I understand best practice is to
> store it in denormalized form, and this will definitely make some of our
> future queries much easier. But I have a problem with denormalizing the
> data.
>
>
>
> For example, let’s say one of my queries is “the number of reports
> generated by user type”. In the part of the application that the user
> connects to to generate reports, we only have access to the user id. In a
> traditional RDBMS, this is fine, because at query time you join the user id
> onto the users table and get all the user data associated with that user.
> But how do I populate extra fields like user type on the fly?
>
>
>
> My ideas so far:
>
> 1.       I try and maintain an in-memory cache of data such as “user”,
> and do a lookup to this cache for every user action and store the user data
> with it. #PROS: fast #CONS: not scalable, will run out of memory if data
> sets grow
>
> 2.       For each user action, I do a call to RDBMS and look up the data
> for the user in question, then store the user action plus the user data as
> a single row. #PROS easy to scale #CONS slow
>
> 3.       I write only the user id and the action straight away, and have
> a separate batch process that periodically goes through my table looking
> for rows without user data, and looks up the user data from RDBMS and
> populates it
>
>
>
>
>
> None of these solutions seem ideal to me. Does Cassandra have something
> like ‘triggers’, where I can set up a table to automatically populate some
> rows based on a lookup from another table? Or perhaps Spark or some other
> library has built-in functionality that solves exactly this problem?
>
>
>
> Any suggestions much appreciated.
>
>
>
> Thanks,
>
> Matthew
>
>
>

Mime
View raw message