cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tamar Fraenkel <>
Subject Data Model Question
Date Fri, 20 Jan 2012 09:11:54 GMT
I am a newbie to Cassandra and seeking some advice regarding the data model I
should use to best address my needs.
For simplicity, what I want to accomplish is:
I have a system that has users (potentially ~10,000 per day) and they perform
actions in the system (total of ~50,000 a day).
Each User’s action is taking place in a certain point in time, and is also
classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
Categories and Tags has a score associated with it, the score is between 0 to 1
(let’s assume precision of 0.0001).
I want to be able to identify similar actions in the system (performed usually
by more than one user). Similarity of actions is calculated based on their
common Categories and Tags taking scores into account.
I need the system to store:

The list of my users with attributes like name, age etc
For each action – the categories and tags associated with it and their score,
the time of the action, and the user who performed it.
Groups of similar actions (ActionGroups) – the id’s of actions in the group, the
categories and tags describing the group, with their scores. Those are
calculated using an algorithm that takes into account the categories and tags of
the actions in the group.

When a user performs a new action in the system, I want to add it to a fitting
ActionGroups (with similar categories and tags).
For this I need to be able to perform the following:
Find all the recent ActionGroups (those who were updated with actions performed
during the last T minutes), who has at list one of the new action’s categories
AND at list one of the new action’s tags.
I thought of two ways to address the issue and I would appreciate your insights.
First one using secondary indexes
Column Family:Users
Key: userId
Compare with Bytes Type
Columns: name: <>, age: <> etc…
Column Family:Actions
Key: actionId
Compare with Bytes Type
Columns:  Category1 : <Score> ….
          CategoriN: <Score>,
          Tag1 : <Score>, ….
          Time: timestamp
          user: userId
Column Family:ActionGroups
Key: actionGroupId
Compare with Bytes Type
Columns: Category1 : <Score> ….
         CategoriN: <Score>,
         Tag1 : <Score> ….
         lastUpdateTime: timestamp
         actionId1: null, … ,
         actionIdM: null
I will then define secondary index on each tag columns, category columns, and
the update time column.
Let’s assume the new action I want to add to ActionGroup has NewActionCategory1
- NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the
following query:
Select  * From ActionGroups where
   (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
   (NewActionTag1 > 0  … or NewActionTagN > 0) and
   lastUpdateTime > T;
Second solution
Have the same CF as in the first solutionwithout the secondaryindex, and have
two additional CF-ies:
Column Family:CategoriesToActionGroupId
Key: categoryId
Compare with ByteType
Columns: {Timestamp, ActionGroupsId1 } : null
         {Timestamp, ActionGroupsId2} : null
*timestamp is the update time for the ActionGroup
A similar CF will be defined for tags.
I will then be able to run several queries on CategoriesToActionGroupId (one for
each of the new story Categories), with column slice for the right update time
of the ActionGroup.
I will do the same for the TagsToActionGroupId.
I will then use my client code to remove duplicates (ActionGroups who are
associated with more than one Tag or Category).
My questions are:

Are the two solutions viable? If yes, which is better
Is there any better way of doing this?
Can I use jdbc and CQL with both method, or do I have to use Hector (I am using

View raw message