incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Data Model Design for Login Servie
Date Fri, 18 Nov 2011 15:29:37 GMT
Secondary indexes in Cassandra are not good fit for High Cardinality values

On Fri, Nov 18, 2011 at 7:14 AM, Dan Hendry <dan.hendry.junk@gmail.com> wrote:
> I they are not limited to repeating values but the Datastax docs[1] on
> secondary indexes certainly seem to indicate they would be a poor fit for
> this case (high read load, many unique values).
>
>
>
> [1] http://www.datastax.com/docs/1.0/ddl/indexes
>
>
>
> Dan
>
>
>
> From: Maciej Miklas [mailto:mac.miklas@googlemail.com]
> Sent: November-18-11 1:39
> To: user@cassandra.apache.org
> Subject: Re: Data Model Design for Login Servie
>
>
>
> but secondary index is limited only to repeating values like enums. In my
> case I would have performance issue. right?
>
> On 18.11.2011, at 02:08, Maxim Potekhin <potekhin@bnl.gov> wrote:
>
> 1122: {
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>           alias1: alfred.tester@xyz.de
>           alias2: alfred@aad.de
>           alias3: alf@dd.de
>          }
>
> ...and you can use secondary indexes to query on anything.
>
> Maxim
>
>
> On 11/17/2011 4:08 PM, Maciej Miklas wrote:
>
> Hallo all,
>
> I need your help to design structure for simple login service. It contains
> about 100.000.000 customers and each one can have about 10 different logins
> - this results 1.000.000.000 different logins.
>
> Each customer contains following data:
> - one to many login names as string, max 20 UTF-8 characters long
> - ID as long - one customer has only one ID
> - gender
> - birth date
> - name
> - password as MD5
>
> Login process needs to find user by login name.
> Data in Cassandra is replicated - this is necessary to obtain all required
> login data in single call. Also usually we expect low write traffic and
> heavy read traffic - round trips for reading data should be avoided.
> Below I've described two possible cassandra data models based on example: we
> have two users, first user has two logins and second user has three logins
>
> A) Skinny rows
>  - row key contains login name - this is the main search criteria
>  - login data is replicated - each possible login is stored as single row
> which contains all user data - 10 logins for single customer create 10 rows,
> where each row has different key and the same content
>
>     // first 3 rows has different key and the same replicated data
>         alfred.tester@xyz.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>         alfred@aad.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>         alf@dd.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>
>     // two following rows has again the same data for second customer
>         manfred@xyz.de {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         },
>         roberrto@xyz.de {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         }
>
> B) Rows grouped by alphabetical prefix
> - Number of rows is limited - for example first letter from login name
> - Each row contains all logins which benign with row key - row with key 'a'
> contains all logins which begin with 'a'
> - Data might be unbalanced, but we avoid skinny rows - this might have
> positive performance impact (??)
> - to avoid super columns each row contains directly columns, where column
> name is the user login and column value is corresponding data in kind of
> serialized form (I would like to have is human readable)
>
>     a {
>         alfred.tester@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa"
>       },
>
>     m {
>         manfred@xyz.de:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>       },
>
>     r {
>         roberrto@xyz.de:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>
>       }
>
> Which solution is better, especially for better read performance? Do you
> have better idea?
>
> Thanks,
> Maciej
>
>
>
> No virus found in this incoming message.
>
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11
> 02:34:00

Mime
View raw message