Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of potekhin@bnl.gov designates
 130.199.3.132 as permitted sender)
Message-ID: <4EC5AFF2.9030100@bnl.gov>
Date: Thu, 17 Nov 2011 20:08:02 -0500
From: Maxim Potekhin <potekhin@bnl.gov>
Reply-To: potekhin@bnl.gov
Organization: Brookhaven National Laboratory
User-Agent: Mozilla/5.0 (Windows NT 5.1;
 rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Data Model Design for Login Servie
References: 
 <CALk=J59KBbjR+OAa0dTc-3sTx3SpV4daJ4AkzsDLTNciXWqqmQ@mail.gmail.com>
In-Reply-To: 
 <CALk=J59KBbjR+OAa0dTc-3sTx3SpV4daJ4AkzsDLTNciXWqqmQ@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------000404090001010004020603"

This is a multi-part message in MIME format.
--------------000404090001010004020603
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

1122: {
           gender: MALE
           birthdate: 1987.11.09
           name: Alfred Tester
           pwd: e72c504dc16c8fcd2fe8c74bb492affa
           alias1: alfred.tester@xyz.de <mailto:alfred.tester@xyz.de>
           alias2: alfred@aad.de <mailto:alfred@aad.de>
           alias3: alf@dd.de <mailto:alf@dd.de>
          }

...and you can use secondary indexes to query on anything.

Maxim


On 11/17/2011 4:08 PM, Maciej Miklas wrote:
> Hallo all,
>
> I need your help to design structure for simple login service. It 
> contains about 100.000.000 customers and each one can have about 10 
> different logins - this results 1.000.000.000 different logins.
>
> Each customer contains following data:
> - one to many login names as string, max 20 UTF-8 characters long
> - ID as long - one customer has only one ID
> - gender
> - birth date
> - name
> - password as MD5
>
> Login process needs to find user by login name.
> Data in Cassandra is replicated - this is necessary to obtain all 
> required login data in single call. Also usually we expect low write 
> traffic and heavy read traffic - round trips for reading data should 
> be avoided.
> Below I've described two possible cassandra data models based on 
> example: we have two users, first user has two logins and second user 
> has three logins
>
> A) Skinny rows
>  - row key contains login name - this is the main search criteria
>  - login data is replicated - each possible login is stored as single 
> row which contains all user data - 10 logins for single customer 
> create 10 rows, where each row has different key and the same content
>
>     // first 3 rows has different key and the same replicated data
> alfred.tester@xyz.de <mailto:alfred.tester@xyz.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
> alfred@aad.de <mailto:alfred@aad.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
> alf@dd.de <mailto:alf@dd.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>
>     // two following rows has again the same data for second customer
> manfred@xyz.de <mailto:manfred@xyz.de> {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         },
> roberrto@xyz.de <mailto:roberrto@xyz.de> {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         }
>
> B) Rows grouped by alphabetical prefix
> - Number of rows is limited - for example first letter from login name
> - Each row contains all logins which benign with row key - row with 
> key 'a' contains all logins which begin with 'a'
> - Data might be unbalanced, but we avoid skinny rows - this might have 
> positive performance impact (??)
> - to avoid super columns each row contains directly columns, where 
> column name is the user login and column value is corresponding data 
> in kind of serialized form (I would like to have is human readable)
>
>     a {
> alfred.tester@xyz.de <mailto:alfred.tester@xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alfred@aad.de@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alf@dd.de@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa"
>       },
>
>     m {
> manfred@xyz.de <mailto:manfred@xyz.de>:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>       },
>
>     r {
> roberrto@xyz.de <mailto:roberrto@xyz.de>:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>
>       }
>
> Which solution is better, especially for better read performance? Do 
> you have better idea?
>
> Thanks,
> Maciej


--------------000404090001010004020603
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    1122: {<br>
              gender: MALE<br>
              birthdate: 1987.11.09<br>
              name: Alfred Tester<br>
              pwd: e72c504dc16c8fcd2fe8c74bb492affa<br>
              alias1: <a moz-do-not-send="true"
      href="mailto:alfred.tester@xyz.de">alfred.tester@xyz.de</a><br>
              alias2: <a moz-do-not-send="true"
      href="mailto:alfred@aad.de">alfred@aad.de</a><br>
              alias3: <a moz-do-not-send="true" href="mailto:alf@dd.de">alf@dd.de</a><br>
             }<br>
    <br>
    ...and you can use secondary indexes to query on anything.<br>
    <br>
    Maxim<br>
    <br>
    <br>
    On 11/17/2011 4:08 PM, Maciej Miklas wrote:
    <blockquote
cite="mid:CALk=J59KBbjR+OAa0dTc-3sTx3SpV4daJ4AkzsDLTNciXWqqmQ@mail.gmail.com"
      type="cite">Hallo all,<br>
      <br>
      I need your help to design structure for simple login service. It
      contains about 100.000.000 customers and each one can have about
      10 different logins - this results 1.000.000.000 different logins.<br>
          <br>
      Each customer contains following data:<br>
      - one to many login names as string, max 20 UTF-8 characters long<br>
      - ID as long - one customer has only one ID<br>
      - gender<br>
      - birth date<br>
      - name<br>
      - password as MD5<br>
      <br>
      Login process needs to find user by login name.<br>
      Data in Cassandra is replicated - this is necessary to obtain all
      required login data in single call. Also usually we expect low
      write traffic and heavy read traffic - round trips for reading
      data should be avoided.<br>
      Below I've described two possible cassandra data models based on
      example: we have two users, first user has two logins and second
      user has three logins<br>
         <br>
      A) Skinny rows<br>
       - row key contains login name - this is the main search criteria<br>
       - login data is replicated - each possible login is stored as
      single row which contains all user data - 10 logins for single
      customer create 10 rows, where each row has different key and the
      same content<br>
      <br>
          // first 3 rows has different key and the same replicated data<br>
              <a moz-do-not-send="true"
        href="mailto:alfred.tester@xyz.de">alfred.tester@xyz.de</a> {<br>
                id: 1122<br>
                gender: MALE<br>
                birthdate: 1987.11.09<br>
                name: Alfred Tester<br>
                pwd: e72c504dc16c8fcd2fe8c74bb492affa  <br>
              },<br>
              <a moz-do-not-send="true" href="mailto:alfred@aad.de">alfred@aad.de</a>
      {<br>
                id: 1122<br>
                gender: MALE<br>
                birthdate: 1987.11.09<br>
                name: Alfred Tester<br>
                pwd: e72c504dc16c8fcd2fe8c74bb492affa  <br>
              },<br>
              <a moz-do-not-send="true" href="mailto:alf@dd.de">alf@dd.de</a>
      {<br>
                id: 1122<br>
                gender: MALE<br>
                birthdate: 1987.11.09<br>
                name: Alfred Tester<br>
                pwd: e72c504dc16c8fcd2fe8c74bb492affa  <br>
              },<br>
          <br>
          // two following rows has again the same data for second
      customer<br>
              <a moz-do-not-send="true" href="mailto:manfred@xyz.de">manfred@xyz.de</a>
      {<br>
                id: 1133<br>
                gender: MALE<br>
                birthdate: 1997.02.01<br>
                name: Manfredus Maximus<br>
                pwd: e44c504ff16c8fcd2fe8c74bb492adda  <br>
              },<br>
              <a moz-do-not-send="true" href="mailto:roberrto@xyz.de">roberrto@xyz.de</a>
      {<br>
                id: 1133<br>
                gender: MALE<br>
                birthdate: 1997.02.01<br>
                name: Manfredus Maximus<br>
                pwd: e44c504ff16c8fcd2fe8c74bb492adda  <br>
              }<br>
          <br>
      B) Rows grouped by alphabetical prefix<br>
      - Number of rows is limited - for example first letter from login
      name<br>
      - Each row contains all logins which benign with row key - row
      with key 'a' contains all logins which begin with 'a'<br>
      - Data might be unbalanced, but we avoid skinny rows - this might
      have positive performance impact (??)<br>
      - to avoid super columns each row contains directly columns, where
      column name is the user login and column value is corresponding
      data in kind of serialized form (I would like to have is human
      readable)<br>
      <br>
          a {<br>
              <a moz-do-not-send="true"
        href="mailto:alfred.tester@xyz.de">alfred.tester@xyz.de</a>:"1122;MALE;1987.11.09;<br>
                                       Alfred
      Tester;e72c504dc16c8fcd2fe8c74bb492affa",<br>
              <br>
              <a class="moz-txt-link-abbreviated" href="mailto:alfred@aad.de@">alfred@aad.de@</a><a moz-do-not-send="true"
        href="http://xyz.de">xyz.de</a>:"1122;MALE;1987.11.09;<br>
                                       Alfred
      Tester;e72c504dc16c8fcd2fe8c74bb492affa",<br>
                  <br>
              <a class="moz-txt-link-abbreviated" href="mailto:alf@dd.de@">alf@dd.de@</a><a moz-do-not-send="true" href="http://xyz.de">xyz.de</a>:"1122;MALE;1987.11.09;<br>
                                       Alfred
      Tester;e72c504dc16c8fcd2fe8c74bb492affa"<br>
            },<br>
                  <br>
          m {<br>
              <a moz-do-not-send="true" href="mailto:manfred@xyz.de">manfred@xyz.de</a>:"1133;MALE;1997.02.01;<br>
                        Manfredus
      Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    <br>
            },<br>
                  <br>
          r {<br>
              <a moz-do-not-send="true" href="mailto:roberrto@xyz.de">roberrto@xyz.de</a>:"1133;MALE;1997.02.01;<br>
                        Manfredus
      Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    <br>
                  <br>
            }<br>
      <br>
      Which solution is better, especially for better read performance?
      Do you have better idea?<br>
      <br>
      Thanks,<br>
      Maciej<br>
    </blockquote>
    <br>
  </body>
</html>

--------------000404090001010004020603--