incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hendry" <dan.hendry.j...@gmail.com>
Subject RE: Data Model Design for Login Servie
Date Thu, 17 Nov 2011 22:45:46 GMT
Your first approach, skinny rows, will almost certainly be a better solution although it never
hurts to experiment for yourself. Even for low end hardware (for sake of argument, EC2 m1.smalls),
a few million rows is basically nothing (again though, I encourage you to verify for yourself).
For read heavy workloads, skinny rows allow for more effective use of the key cache and possibly
row cache. I advise caution when using the row cache however – I have never found it useful
(in 0.7 and 0.8 at least) as it introduces too much memory pressure for generally random read
workloads, benchmark against your specific case.

 

Dan

 

From: Maciej Miklas [mailto:mac.miklas@googlemail.com] 
Sent: November-17-11 16:08
To: user@cassandra.apache.org
Subject: Data Model Design for Login Servie

 

Hallo all,

I need your help to design structure for simple login service. It contains about 100.000.000
customers and each one can have about 10 different logins - this results 1.000.000.000 different
logins.
    
Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required login data in single
call. Also usually we expect low write traffic and heavy read traffic - round trips for reading
data should be avoided.
Below I've described two possible cassandra data models based on example: we have two users,
first user has two logins and second user has three logins
   
A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row which contains all
user data - 10 logins for single customer create 10 rows, where each row has different key
and the same content

    // first 3 rows has different key and the same replicated data
        alfred.tester@xyz.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alfred@aad.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alf@dd.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
    
    // two following rows has again the same data for second customer
        manfred@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        },
        roberrto@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        }
    
B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a' contains all logins
which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have positive performance
impact (??)
- to avoid super columns each row contains directly columns, where column name is the user
login and column value is corresponding data in kind of serialized form (I would like to have
is human readable)

    a {
        alfred.tester@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
        
        alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
            
        alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa"
      },
            
    m {
        manfred@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
      },
            
    r {
        roberrto@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
            
      }

Which solution is better, especially for better read performance? Do you have better idea?

Thanks,
Maciej

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.920 / Virus Database: 271.1.1/4020 - Release Date: 11/17/11 02:34:00


Mime
View raw message