cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Data Model Question
Date Sun, 22 Jan 2012 19:53:05 GMT
In general if you are collecting data over time you should consider partitioning the row's
to avoid creating very large rows. Also if you have a common request you want to support consider
modeling it directly rather than using secondary indexes. 

Assuming my understanding of the problem is in some one way correct I would consider this
for the Action Groups…

Pick a time partition, this is the minimum time resolution you are interested in (your T Minutes).


CF: DimensionUpdates
Stores which dimensions (tags, categories) were updated in the time partition. 

key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or
"category" and <dimension_value> is a value from that domain. e.g. <tag:foo>
col_value: empty


CF: DimensionFacts
Stores the facts that included the dimension in a time partition. 

key: <time_partition:dimension_name:dimension_value> definitions as above.
col_names: ActionGroupID. 
col_values: empty

So to…

> Find all the recent ActionGroups (those who were updated with actions performed during
the last T minutes), who has at list one of the new action’s categories AND at list one
of the new action’s tags. 

1) Query the DimensionUpdates CF with the  current time partition as the key, and the tags
and columns the action group has. 
2) For each column returned from (1) query the rows in DimensionFacts to get the ActionGroups.
3) Filter the unique set of ActionGroups client side. 


Some notes:
1) Row size in all cases are bound to the time partition size. This will make your life easier
when it comes to repair and compaction. By default rows sizes of 64MB will take a slower 2
pass approach that will cost you IO. 

2) All queries are bound. Query 1 will only want request 1 to 35 columns from a row that contains
0 to 35 columns. Query 2 can be done as either a multi get (select with lots of KEY clauses)
or a series of multi gets, and can be further bound by limiting the number of columns in each
request. Making queries that take for a lot of rows at once can harm overall query throughput.
  

3) Overwrites (writing the to the same row) are bound by the time partition. Depending on
load this *may* mean that rows are only physically written to one SSTable. 

4) You will also want to partition the list of actions in an actiongroup.

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:

> Hi!
> I am a newbie to Cassandra and seeking some advice regarding the data model I should
use to best address my needs.
> 
> For simplicity, what I want to accomplish is:
> 
> I have a system that has users (potentially ~10,000 per day) and they perform actions
in the system (total of ~50,000 a day).
> 
> Each User’s action is taking place in a certain point in time, and is also classified
into categories (1 to 5) and tagged by 1-30 tags. Each action’s Categories and Tags has
a score associated with it, the score is between 0 to 1 (let’s assume precision of 0.0001).
> 
> I want to be able to identify similar actions in the system (performed usually by more
than one user). Similarity of actions is calculated based on their common Categories and Tags
taking scores into account.
> 
> I need the system to store:
> 
> The list of my users with attributes like name, age etc
> For each action – the categories and tags associated with it and their score, the time
of the action, and the user who performed it.
> Groups of similar actions (ActionGroups) – the id’s of actions in the group, the
categories and tags describing the group, with their scores. Those are calculated using an
algorithm that takes into account the categories and tags of the actions in the group.
> When a user performs a new action in the system, I want to add it to a fitting ActionGroups
(with similar categories and tags).
> 
> For this I need to be able to perform the following:
> 
> Find all the recent ActionGroups (those who were updated with actions performed during
the last T minutes), who has at list one of the new action’s categories AND at list one
of the new action’s tags.
> 
>  
> I thought of two ways to address the issue and I would appreciate your insights.
> 
>  
> First one using secondary indexes
> 
> Column Family: Users
> 
> Key: userId
> 
> Compare with Bytes Type
> 
> Columns: name: <>, age: <> etc…
> 
>  
> Column Family: Actions
> 
> Key: actionId
> 
> Compare with Bytes Type
> 
> Columns:  Category1 : <Score> ….
> 
>           CategoriN: <Score>,
> 
>           Tag1 : <Score>, ….
> 
>           TagK:<Score>
> 
>           Time: timestamp
> 
>           user: userId
> 
>  
> Column Family: ActionGroups
> 
> Key: actionGroupId
> 
> Compare with Bytes Type
> 
> Columns: Category1 : <Score> ….
> 
>          CategoriN: <Score>,
> 
>          Tag1 : <Score> ….
> 
>          TagK:<Score>
> 
>          lastUpdateTime: timestamp
> 
>          actionId1: null, … ,
> 
>          actionIdM: null
> 
>  
> I will then define secondary index on each tag columns, category columns, and the update
time column.
> 
> Let’s assume the new action I want to add to ActionGroup has NewActionCategory1 - NewActionCategoryK,
and has NewActionTag1 – NewActionTagN. I will perform the following query:
> 
> Select  * From ActionGroups where
> 
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> 
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> 
>    lastUpdateTime > T;
> 
>  
> Second solution
> 
> Have the same CF as in the first solution without the secondary index , and have two
additional CF-ies:
> 
> Column Family: CategoriesToActionGroupId
> 
> Key: categoryId
> 
> Compare with ByteType
> 
> Columns: {Timestamp, ActionGroupsId1 } : null
> 
>          {Timestamp, ActionGroupsId2} : null
> 
>          ...
> 
> *timestamp is the update time for the ActionGroup
> 
>  
> A similar CF will be defined for tags.
> 
>  
> I will then be able to run several queries on CategoriesToActionGroupId (one for each
of the new story Categories), with column slice for the right update time of the ActionGroup.
> 
> I will do the same for the TagsToActionGroupId.
> 
> I will then use my client code to remove duplicates (ActionGroups who are associated
with more than one Tag or Category).
> 
>  
> My questions are:
> 
> Are the two solutions viable? If yes, which is better
> Is there any better way of doing this?
> Can I use jdbc and CQL with both method, or do I have to use Hector (I am using Java).
> Thanks
> 
> Tamar
> 
>  
>  


Mime
View raw message