Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of mohitanchlia@gmail.com
 designates 209.85.210.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4ec6764d.67b4ec0a.51b4.57a5@mx.google.com>
References: 
 <CALk=J59KBbjR+OAa0dTc-3sTx3SpV4daJ4AkzsDLTNciXWqqmQ@mail.gmail.com>
	<14337969.34049.1321598081899.JavaMail.mobile-sync@ynfp3>
	<2135879745477714995@unknownmsgid>
	<4ec6764d.67b4ec0a.51b4.57a5@mx.google.com>
Date: Fri, 18 Nov 2011 07:29:37 -0800
Message-ID: 
 <CAOT3TWorXTFQmcHd2v6-9SrXS_2QULLjA1E2i3SZebTe97ufRg@mail.gmail.com>
Subject: Re: Data Model Design for Login Servie
From: Mohit Anchlia <mohitanchlia@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Secondary indexes in Cassandra are not good fit for High Cardinality values

On Fri, Nov 18, 2011 at 7:14 AM, Dan Hendry <dan.hendry.junk@gmail.com> wro=
te:
> I they are not limited to repeating values but the Datastax docs[1] on
> secondary indexes certainly seem to indicate they would be a poor fit for
> this case (high read load, many unique values).
>
>
>
> [1] http://www.datastax.com/docs/1.0/ddl/indexes
>
>
>
> Dan
>
>
>
> From: Maciej Miklas [mailto:mac.miklas@googlemail.com]
> Sent: November-18-11 1:39
> To: user@cassandra.apache.org
> Subject: Re: Data Model Design for Login Servie
>
>
>
> but secondary index is limited only to repeating values like enums. In my
> case I would have performance issue. right?
>
> On 18.11.2011, at 02:08, Maxim Potekhin <potekhin@bnl.gov> wrote:
>
> 1122: {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias1: alfred.tester@xyz.de
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias2: alfred@aad.de
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias3: alf@dd.de
> =A0=A0=A0=A0=A0=A0=A0=A0 }
>
> ...and you can use secondary indexes to query on anything.
>
> Maxim
>
>
> On 11/17/2011 4:08 PM, Maciej Miklas wrote:
>
> Hallo all,
>
> I need your help to design structure for simple login service. It contain=
s
> about 100.000.000 customers and each one can have about 10 different logi=
ns
> - this results 1.000.000.000 different logins.
>
> Each customer contains following data:
> - one to many login names as string, max 20 UTF-8 characters long
> - ID as long - one customer has only one ID
> - gender
> - birth date
> - name
> - password as MD5
>
> Login process needs to find user by login name.
> Data in Cassandra is replicated - this is necessary to obtain all require=
d
> login data in single call. Also usually we expect low write traffic and
> heavy read traffic - round trips for reading data should be avoided.
> Below I've described two possible cassandra data models based on example:=
 we
> have two users, first user has two logins and second user has three login=
s
>
> A) Skinny rows
> =A0- row key contains login name - this is the main search criteria
> =A0- login data is replicated - each possible login is stored as single r=
ow
> which contains all user data - 10 logins for single customer create 10 ro=
ws,
> where each row has different key and the same content
>
> =A0=A0=A0 // first 3 rows has different key and the same replicated data
> =A0=A0=A0=A0=A0=A0=A0 alfred.tester@xyz.de {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa
> =A0=A0=A0=A0=A0=A0=A0 },
> =A0=A0=A0=A0=A0=A0=A0 alfred@aad.de {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa
> =A0=A0=A0=A0=A0=A0=A0 },
> =A0=A0=A0=A0=A0=A0=A0 alf@dd.de {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa
> =A0=A0=A0=A0=A0=A0=A0 },
>
> =A0=A0=A0 // two following rows has again the same data for second custom=
er
> =A0=A0=A0=A0=A0=A0=A0 manfred@xyz.de {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1133
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1997.02.01
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Manfredus Maximus
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e44c504ff16c8fcd2fe8c74bb492adda
> =A0=A0=A0=A0=A0=A0=A0 },
> =A0=A0=A0=A0=A0=A0=A0 roberrto@xyz.de {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1133
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1997.02.01
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Manfredus Maximus
> =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e44c504ff16c8fcd2fe8c74bb492adda
> =A0=A0=A0=A0=A0=A0=A0 }
>
> B) Rows grouped by alphabetical prefix
> - Number of rows is limited - for example first letter from login name
> - Each row contains all logins which benign with row key - row with key '=
a'
> contains all logins which begin with 'a'
> - Data might be unbalanced, but we avoid skinny rows - this might have
> positive performance impact (??)
> - to avoid super columns each row contains directly columns, where column
> name is the user login and column value is corresponding data in kind of
> serialized form (I would like to have is human readable)
>
> =A0=A0=A0 a {
> =A0=A0=A0=A0=A0=A0=A0 alfred.tester@xyz.de:"1122;MALE;1987.11.09;
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
> =A0=A0=A0=A0=A0=A0=A0 alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
> =A0=A0=A0=A0=A0=A0=A0 alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa"
> =A0=A0=A0=A0=A0 },
>
> =A0=A0=A0 m {
> =A0=A0=A0=A0=A0=A0=A0 manfred@xyz.de:"1133;MALE;1997.02.01;
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Manfredus Maximus;e44=
c504ff16c8fcd2fe8c74bb492adda"
> =A0=A0=A0=A0=A0 },
>
> =A0=A0=A0 r {
> =A0=A0=A0=A0=A0=A0=A0 roberrto@xyz.de:"1133;MALE;1997.02.01;
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Manfredus Maximus;e44=
c504ff16c8fcd2fe8c74bb492adda"
>
> =A0=A0=A0=A0=A0 }
>
> Which solution is better, especially for better read performance? Do you
> have better idea?
>
> Thanks,
> Maciej
>
>
>
> No virus found in this incoming message.
>
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11
> 02:34:00