Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12F079EEF for ; Fri, 18 Nov 2011 15:30:07 +0000 (UTC) Received: (qmail 86139 invoked by uid 500); 18 Nov 2011 15:30:04 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86113 invoked by uid 500); 18 Nov 2011 15:30:04 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86105 invoked by uid 99); 18 Nov 2011 15:30:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Nov 2011 15:30:04 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mohitanchlia@gmail.com designates 209.85.210.46 as permitted sender) Received: from [209.85.210.46] (HELO mail-pz0-f46.google.com) (209.85.210.46) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Nov 2011 15:29:58 +0000 Received: by pzk2 with SMTP id 2so6329161pzk.5 for ; Fri, 18 Nov 2011 07:29:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=344BDQQFxFUJyaUManTTcWxWREPOrifGCeLWC27wFQY=; b=QSsJCfXGAaCqMQBbeSngAvoR834xkUWngbhPb8RnQ3cQSsa13AnTu9OsiCgoo/uxU6 EeHDPlfC/lN7GwzpYhXxrniUySJ9oaPgc9xXNfzysLSxw7KsyG1exV0IaEz7A0IBcEug pG49HUuoHo4Qaj7usz0EHlvIB9hJV69NxuYSc= MIME-Version: 1.0 Received: by 10.68.35.103 with SMTP id g7mr10357561pbj.53.1321630177408; Fri, 18 Nov 2011 07:29:37 -0800 (PST) Received: by 10.68.58.227 with HTTP; Fri, 18 Nov 2011 07:29:37 -0800 (PST) In-Reply-To: <4ec6764d.67b4ec0a.51b4.57a5@mx.google.com> References: <14337969.34049.1321598081899.JavaMail.mobile-sync@ynfp3> <2135879745477714995@unknownmsgid> <4ec6764d.67b4ec0a.51b4.57a5@mx.google.com> Date: Fri, 18 Nov 2011 07:29:37 -0800 Message-ID: Subject: Re: Data Model Design for Login Servie From: Mohit Anchlia To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Secondary indexes in Cassandra are not good fit for High Cardinality values On Fri, Nov 18, 2011 at 7:14 AM, Dan Hendry wro= te: > I they are not limited to repeating values but the Datastax docs[1] on > secondary indexes certainly seem to indicate they would be a poor fit for > this case (high read load, many unique values). > > > > [1] http://www.datastax.com/docs/1.0/ddl/indexes > > > > Dan > > > > From: Maciej Miklas [mailto:mac.miklas@googlemail.com] > Sent: November-18-11 1:39 > To: user@cassandra.apache.org > Subject: Re: Data Model Design for Login Servie > > > > but secondary index is limited only to repeating values like enums. In my > case I would have performance issue. right? > > On 18.11.2011, at 02:08, Maxim Potekhin wrote: > > 1122: { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa > =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias1: alfred.tester@xyz.de > =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias2: alfred@aad.de > =A0=A0=A0=A0=A0=A0=A0=A0=A0 alias3: alf@dd.de > =A0=A0=A0=A0=A0=A0=A0=A0 } > > ...and you can use secondary indexes to query on anything. > > Maxim > > > On 11/17/2011 4:08 PM, Maciej Miklas wrote: > > Hallo all, > > I need your help to design structure for simple login service. It contain= s > about 100.000.000 customers and each one can have about 10 different logi= ns > - this results 1.000.000.000 different logins. > > Each customer contains following data: > - one to many login names as string, max 20 UTF-8 characters long > - ID as long - one customer has only one ID > - gender > - birth date > - name > - password as MD5 > > Login process needs to find user by login name. > Data in Cassandra is replicated - this is necessary to obtain all require= d > login data in single call. Also usually we expect low write traffic and > heavy read traffic - round trips for reading data should be avoided. > Below I've described two possible cassandra data models based on example:= we > have two users, first user has two logins and second user has three login= s > > A) Skinny rows > =A0- row key contains login name - this is the main search criteria > =A0- login data is replicated - each possible login is stored as single r= ow > which contains all user data - 10 logins for single customer create 10 ro= ws, > where each row has different key and the same content > > =A0=A0=A0 // first 3 rows has different key and the same replicated data > =A0=A0=A0=A0=A0=A0=A0 alfred.tester@xyz.de { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa > =A0=A0=A0=A0=A0=A0=A0 }, > =A0=A0=A0=A0=A0=A0=A0 alfred@aad.de { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa > =A0=A0=A0=A0=A0=A0=A0 }, > =A0=A0=A0=A0=A0=A0=A0 alf@dd.de { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1122 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1987.11.09 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Alfred Tester > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e72c504dc16c8fcd2fe8c74bb492affa > =A0=A0=A0=A0=A0=A0=A0 }, > > =A0=A0=A0 // two following rows has again the same data for second custom= er > =A0=A0=A0=A0=A0=A0=A0 manfred@xyz.de { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1133 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1997.02.01 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Manfredus Maximus > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e44c504ff16c8fcd2fe8c74bb492adda > =A0=A0=A0=A0=A0=A0=A0 }, > =A0=A0=A0=A0=A0=A0=A0 roberrto@xyz.de { > =A0=A0=A0=A0=A0=A0=A0=A0=A0 id: 1133 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 gender: MALE > =A0=A0=A0=A0=A0=A0=A0=A0=A0 birthdate: 1997.02.01 > =A0=A0=A0=A0=A0=A0=A0=A0=A0 name: Manfredus Maximus > =A0=A0=A0=A0=A0=A0=A0=A0=A0 pwd: e44c504ff16c8fcd2fe8c74bb492adda > =A0=A0=A0=A0=A0=A0=A0 } > > B) Rows grouped by alphabetical prefix > - Number of rows is limited - for example first letter from login name > - Each row contains all logins which benign with row key - row with key '= a' > contains all logins which begin with 'a' > - Data might be unbalanced, but we avoid skinny rows - this might have > positive performance impact (??) > - to avoid super columns each row contains directly columns, where column > name is the user login and column value is corresponding data in kind of > serialized form (I would like to have is human readable) > > =A0=A0=A0 a { > =A0=A0=A0=A0=A0=A0=A0 alfred.tester@xyz.de:"1122;MALE;1987.11.09; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa", > > =A0=A0=A0=A0=A0=A0=A0 alfred@aad.de@xyz.de:"1122;MALE;1987.11.09; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa", > > =A0=A0=A0=A0=A0=A0=A0 alf@dd.de@xyz.de:"1122;MALE;1987.11.09; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa" > =A0=A0=A0=A0=A0 }, > > =A0=A0=A0 m { > =A0=A0=A0=A0=A0=A0=A0 manfred@xyz.de:"1133;MALE;1997.02.01; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Manfredus Maximus;e44= c504ff16c8fcd2fe8c74bb492adda" > =A0=A0=A0=A0=A0 }, > > =A0=A0=A0 r { > =A0=A0=A0=A0=A0=A0=A0 roberrto@xyz.de:"1133;MALE;1997.02.01; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Manfredus Maximus;e44= c504ff16c8fcd2fe8c74bb492adda" > > =A0=A0=A0=A0=A0 } > > Which solution is better, especially for better read performance? Do you > have better idea? > > Thanks, > Maciej > > > > No virus found in this incoming message. > > Checked by AVG - www.avg.com > Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11 > 02:34:00