Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 49287 invoked from network); 3 Mar 2011 14:09:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Mar 2011 14:09:38 -0000 Received: (qmail 77835 invoked by uid 500); 3 Mar 2011 14:09:36 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 77735 invoked by uid 500); 3 Mar 2011 14:09:35 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 77720 invoked by uid 99); 3 Mar 2011 14:09:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 14:09:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vodnok@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 14:09:28 +0000 Received: by iwl42 with SMTP id 42so1131842iwl.31 for ; Thu, 03 Mar 2011 06:09:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=PE7ZLqbNi2gM6tf0zcSBLoQ8hArfFcOoTQSh885HirQ=; b=wNS0xSlyFpVJ8ecEjJHZxHNWcXDHvRI1r6uEt45ttwahHWqYlhC2YkjUzkK8k8SQ8W S1HHX75kRxHQA9LPpnYrqOFXCstT/IDOQj2fjl+OlzamW1my3fxE2/j2Ej1/gJidMjFq q05xhUrqpMaKSydk9WR0UshPkjj1xBcSbUkHw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=YDl9pHDZg0WUbzI2b4k0IYTpp71mztcNbo2ltCeWJ7HSOrAsgpgMcy9i8w/1OqGwhW ZLxpnzsuzZjCfF7VG7yu8PI+R/h+IoctTgZKGrp2YN7q+5s3p6VJ43Cg0k1jdpgJgfqg iAqjm/YCHokNiie+/Qn/L5oTkZxExzG1Qj3q4= MIME-Version: 1.0 Received: by 10.42.219.133 with SMTP id hu5mr1634010icb.154.1299161346604; Thu, 03 Mar 2011 06:09:06 -0800 (PST) Received: by 10.42.213.130 with HTTP; Thu, 3 Mar 2011 06:09:06 -0800 (PST) In-Reply-To: References: Date: Thu, 3 Mar 2011 15:09:06 +0100 Message-ID: Subject: Re: Advice on a design From: Vodnok To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf3043477e9224ec049d9492ab X-Virus-Checked: Checked by ClamAV on apache.org --20cf3043477e9224ec049d9492ab Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Ok seems that i'll use Solr (with dedicated Cassandra) for search I've readed this article : http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-order= preservingpartitioner/ on RP vs OPP... Here is my case docs_shared{ //docs shared by users ordered by time 'time:id_user:id_doc' { 'time':'123456' //index on it 'id_user':'123' //index on it 'c_type':'BOT' //index on it 'id_doc':'123' //index on it } } So i can list all doc shared by id_user =3D 123 and type =3D'BOT' ordered b= y time.... Well i wanted because i discovered the RP vs OPP issue. I'm default so RP and so row id are not ordered !!! And as it's recommanded, i would like to stay RP So other possibility is addind a dimension with super column as column are ordered in RP index{ docs_shared{ //docs shared by users ordered by time 'time:id_user:id_doc' { 'time':'123456' //index on it 'id_user':'123' //index on it 'c_type':'BOT' //index on it 'id_doc':'123' } } } BUT.... sexondary index is not possible on SC -> C So next possibility is index{ docs_shared_time_c_type_id_user{ //docs shared by users ordered by time:c_type:id_user 'time:c_type:id_user:id_doc' : 'id_doc' } docs_shared_c_type_time_id_user{ //docs shared by users ordered by time:id_user:c_type 'c_type:time:id_user:id_do' : 'id_doc' } ... (there is 6 combinations of time c_type id_user) } Like that i can list with keystart and keyend and filters Example : No filter : index -> time:c_type:id_user Filter on c_type : index -> c_type:time:id_user Filter on id_user : index -> id_user:time:c_type Filter on c_type and id_user : index -> id_user:c_type:time Fortunately cassandra likes writing !!! (Ironic inside) So i have a question : i've readed that secondary index on SC->C will maybe arrive in next releases... Is this information true ? And is it already planned ? Thank you, S=E9bastien, 2011/3/2 Burc Sade > You can use PHP Solr Extension. It is a fully featured and light-weight > client. > > http://www.php.net/manual/en/book.solr.php > > Without the secondary indexes on columns in CFs within SCFs, the best > approach is to create query-specific CFs at the moment. In the end all co= mes > down to how simple you can make your queries to have a minimum CF count. > > Regards, > Burc > > On Wed, Mar 2, 2011 at 9:06 AM, Vodnok wrote: > >> I think too via Solr it'll be easier. Just need to google it. (if you ha= ve >> links about Solr in php...) >> >> I realize that i have to remove some dimension to my CF... >> >> I thought it was possible to have SCF -> CF -> SC -> C:value having >> secondary index on C but has i understood, secondary index on C on super= is >> not possible for now (but will be maybe in next version) >> As i understand it's better to have more less complex CF then less more >> complex CF >> >> Thank you for your reply, >> >> >> >> 2011/3/2 Burc Sade >> >> Hi Vodnok, >>> >>> For tag searches I would use a search engine like Solr (Lucene), as I >>> think it would be more flexible to query. You can update the index as n= ew >>> data comes in and query it for queries #1, #2 and #4. >>> >>> For "All doc of type=3D'BOT' and c_bot_code=3D'ABC'" query, I would cre= ate >>> the CF below. >>> >>> doc_types >>> { >>> 'BOT:ABC': >>> { >>> : >>> } >>> } >>> >>> You can assign a value you are going to need when after querying to the >>> docid. The problem with this schema is that if there are not many >>> type:c_bot_code combinations, there will be many columns under each key= in >>> this CF. If a combination has much much more columns than others, hot s= pot >>> problem may arise. >>> >>> >>> >>> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok wrote: >>> >>>> Hi, >>>> >>>> Totaly newbie on Cassandra (with phpcassa) with big background on >>>> relationned database, i'm would like to use Cassandra for a trivial ca= se. So >>>> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure = i'm >>>> wrong but i want to learn so i'm here >>>> >>>> >>>> I would like your advise on a design for cassandra. >>>> >>>> >>>> Case: >>>> >>>> - Users created Docs and can share docs with friends >>>> - Users can read and share docs of their friends with other friends >>>> - Docs can be of different type [text;picture;video;etc] >>>> - Docs can be taggued >>>> >>>> >>>> >>>> Typical queries : >>>> >>>> >>>> - Doc relative to tag >>>> - Doc relative to mutiple tags >>>> - Doc readed by user x >>>> - Doc relative to tag and ratio readed_shared greater than x (see >>>> design) >>>> - All doc of type=3D'IMG' favorized by my friend >>>> - All doc of type=3D'BOT' and c_bot_code=3D'ABC' >>>> - All doc of type=3D'BOT' favorized by my friend relative (tag) with >>>> 'fire' and 'belgium' ? >>>> >>>> >>>> >>>> Design : >>>> >>>> >>>> docs // all docs >>>> { >>>> =91123456=92: //id_docs >>>> { >>>> =91t_info=92: >>>> { >>>> 'c_type':'BOT' >>>> 'b_del':'y' >>>> 'b_publish':'y' >>>> } >>>> 't_info_type': >>>> { >>>> 'l_title':'Hello World!' >>>> 'c_bot_code':'ABC' >>>> } >>>> 't_read_user' : //read by user x >>>> { >>>> //time + id_user >>>> '123456789_123':'123' >>>> '123456789_155':'155' >>>> } >>>> 't_shared_user' : //share by user x >>>> { >>>> //time + id_user >>>> '123456789_123':'123' >>>> '123456789_155':'155' >>>> } >>>> 't_tags' >>>> { >>>> 'fire':'fire' >>>> 'belgium':'belgium' >>>> } >>>> 't_stats' >>>> { >>>> 'n_readed':'60' >>>> 'n_shared':'6' >>>> 'n_ratio_readed_shared':'0.1' >>>> } >>>> } >>>> } >>>> >>>> >>>> tags_docs // all tag linked to docs >>>> { >>>> 'fire'://tag >>>> { >>>> //creation_time + id_docs >>>> '456789_123456': >>>> { >>>> 'id_doc':'123456' >>>> 'time':'456789' >>>> } >>>> '456789_223456':'223456': >>>> { >>>> 'id_doc':'123456' >>>> 'time':'456789' >>>> } >>>> '456789_323456':'223456': >>>> { >>>> 'id_doc':'123456' >>>> 'time':'456789' >>>> } >>>> } >>>> 'belgium': >>>> { >>>> ... >>>> } >>>> } >>>> >>>> >>>> users // all users >>>> { >>>> =91123=92: //id_user >>>> { >>>> =91t_info=92: >>>> { >>>> l_name:'Boris' >>>> c_lang=3D'fr' >>>> >>>> } >>>> 't_readed_docs': >>>> { >>>> //time + id_doc >>>> '123456789_123456':'123456' >>>> '123458789_136456':'136456' >>>> } >>>> 't_shared_docs': >>>> { >>>> //time + id_doc >>>> '123456789_123456':'123456' >>>> '123458789_136456':'136456' >>>> } >>>> } >>>> } >>>> >>>> >>>> users_docs // all action by users on docs >>>> { >>>> =91123_123456=92: // id_user + id_doc >>>> { >>>> 'id_doc':'123456' >>>> 'id_user':'123' >>>> 'd_readed':'20110301' >>>> 'd_shared':'20110301' >>>> } >>>> } >>>> >>>> >>>> user_friends_act // all activity of user friends >>>> { >>>> =91123=92:// id_user >>>> { >>>> 't_readed_docs': //all docs readed by my friends >>>> { >>>> '223456_224_123456': // time + id_friend + id_docs >>>> { >>>> 'id_friend':'224' >>>> 'id_docs':'123456' >>>> 'time':'223456' >>>> 'c_type=3D'BOT' >>>> } >>>> } >>>> 't_shared_docs': //all docs shared by my friends >>>> { >>>> '223456_224_123456': // time + id_friend + id_docs >>>> { >>>> 'id_friend':'224' >>>> 'id_docs':'123456' >>>> 'time':'223456' >>>> 'c_type=3D'BOT' >>>> } >>>> } >>>> } >>>> } >>>> >>>> >>>> >>>> I know that certain queries are not possible for now like : - All doc = of >>>> type=3D'BOT' favorized by my friend relative (tag) with 'fire' and 'be= lgium' ? >>>> >>>> >>>> >>>> What do you think ? >>>> >>>> >>>> Thank you, >>>> >>>> >>>> Vodnok, >>>> >>>> >>>> (Please remember i'm on cassandra since 3 days) >>>> >>> >>> >> > --20cf3043477e9224ec049d9492ab Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Ok seems that i'll us= e Solr (with dedicated Cassandra) for search



Here is my case


docs_shared{ //docs shared by users ordered by time
=A0=A0 =A0'time:id_user:id_doc'=A0
=A0=A0 =A0{
=A0=A0 =A0 =A0 =A0'time':'123456' //index on it
=A0=A0 =A0 =A0 =A0'id_user':'123' //index on it
<= div>=A0=A0 =A0 =A0 =A0'c_type':'BOT' //index on it=A0=A0 =A0 =A0 =A0'id_doc':'123' //index on it =A0 =A0 = =A0
=A0=A0 =A0}
}=A0

So i can list all doc shared by id_user =3D 123 and typ= e =3D'BOT' ordered by time....

Well i want= ed because i discovered the RP vs OPP issue. I'm default so RP and so r= ow id are not ordered !!! And as it's recommanded, i would like to stay= RP

So other possibility is addind a dimension with super c= olumn as column are ordered in RP

index{
docs_shared{ //docs shared by users ordered by time
=A0=A0 =A0&#= 39;time:id_user:id_doc'=A0
=A0=A0 =A0{
=A0=A0 =A0 =A0 =A0'time':'123456'= ; //index on it
=A0=A0 =A0 =A0 =A0'id_user':'123'= //index on it
=A0=A0 =A0 =A0 =A0'c_type':'BOT' /= /index on it
=A0=A0 =A0 =A0 =A0'id_doc':'123'=A0<= /div>
=A0=A0 =A0}
}=A0
}

BUT...= . sexondary index is not possible on SC -> C

So next possibility is

index{
docs_shared_time_c_type_id_user{ //docs shared by users ordered by time:= c_type:id_user
=A0=A0 =A0'time:c_type:id_user:id_doc' : 'id_doc'
}=A0
docs_shared_c_type_time_id_user{ //docs shared by user= s ordered by time:id_user:c_type
=A0=A0 =A0'c_type:time:id_us= er:id_do' : 'id_doc'
}=A0
... (there is 6 combinations of time c_type id_user)
}

Like that i can list with keystart and k= eyend and filters

Example :

No filter : index -> time:c_type:id_user
Filter on c_type= : =A0index -> c_type:time:id_user
Filter on id_user : =A0inde= x -> id_user:time:c_type
Filter on c_type and id_user : index = -> id_user:c_type:time

Fortunately cassandra likes writing !!! (Ironic inside)=


So i have a question : i've re= aded that secondary index on SC->C will maybe arrive in next releases...= Is this information true ? And is it already planned ?


Thank you,

S=E9= bastien,

2011/3/2 Burc Sade <burcsade@gmail.com<= /a>>
You can use PHP Solr Extension. It is a ful= ly featured and light-weight client.


Without the secondary indexes on columns in CFs within = SCFs, the best approach is to create query-specific CFs at the moment. In t= he end all comes down to how simple you can make your queries to have a min= imum CF count.

Regards,
Burc

On Wed, Mar 2, 2011 at 9:06 AM,= Vodnok <vodnok@gmail.com> wrote:
I think too via Solr it'll be easier. Just need to google it. (if = you have links about Solr in php...)

I realize tha= t i have to remove some dimension to my CF...

I=A0thought it was possible to have SCF -> CF -> SC -&g= t; C:value having secondary index on C but has i understood, secondary inde= x on C on super is not possible for now (but will be maybe in next version)=
As i understand it's= better to have more less complex CF then less more complex CF
=

Thank you for your reply,

<= /span>


2011/3/2 Burc Sade <burcsade@gmail.com>

Hi Vodnok,

For tag search= es I would use a search engine like Solr (Lucene), as I think it would be m= ore flexible to query. You can update the index as new data comes in and qu= ery it for queries #1, #2 and #4.

For "All doc of type=3D'BOT' and c_bot_cod= e=3D'ABC'" query, I would create the CF below.

<= /div>
doc_types
{
=A0=A0 'BOT:ABC':
=A0=A0{
=A0=A0 =A0<docid>: <creation_date?>=A0
=A0=A0}= =A0
}

You can assign a value you are goi= ng to need when after querying to the docid. The problem with this schema i= s that if there are not many type:c_bot_code combinations, there will be ma= ny columns under each key in this CF. If a combination has much much more c= olumns than others, hot spot problem may arise.



On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vodnok@gmail.com> = wrote:
Hi,

Totaly newbi= e on Cassandra (with phpcassa) with big background on relationned database,= i'm would like to use Cassandra for a trivial case. So i'm on it s= ince 3 days. Sorry for my stupid question. I'm pretty sure i'm wron= g but i want to learn so i'm here


I would like your advise on a design for= cassandra.


Case:
<= div>
- Users created Docs and can share docs with friends
- Users can read and share docs of their friends with other friends
- Docs can be of different type [text;picture;video;etc]
= - Docs can be taggued



Typical queries :


- Doc relati= ve to tag
- Doc relative to mutiple tags
- Doc readed b= y user x
- Doc relative to tag and ratio readed_shared greater th= an x (see design)
- All doc of type=3D'IMG' favorized by my friend
- A= ll doc of type=3D'BOT' and c_bot_code=3D'ABC'
- A= ll doc of type=3D'BOT' favorized by my friend relative (tag) with &= #39;fire' and 'belgium' ?



Design :


docs // all docs
{
=A0=A0 =A0= =91123456=92: //id_docs
=A0=A0 =A0{
=A0=A0 =A0 =A0 =A0= =91t_info=92:=A0
{<= /div>
'c_type':'B= OT'
'b_del= ':'y'
= 'b_publish':'y'
}
't_info_type':
{
'l_title':'= Hello World!'
= 'c_bot_code':'ABC'
}
't_read_user' : //read by user x=
{
//time + id_user
<= div> '123456789_123':= '123'
'= ;123456789_155':'155'
}
't_shared_user' : //share by use= r x
{
//time + id_user
<= div> '123456789_123':= '123'
'= ;123456789_155':'155'
}
't_tags'
{
'fire':'fir= e'
'belgiu= m':'belgium'
<= /span>}
't_stats'
<= div> {
'n_readed':'60'
'n_shared':'= ;6'
'n_rat= io_readed_shared':'0.1' =
}
}
}

tags_docs // all tag linked to docs
{
'fire'= ;://tag
{
= //creation_time + id_docs
'456789_123456':=
{
'id_doc':'123456'<= /div>
'time':'456= 789'
}
'456789_223456':= 9;223456':
{
'id_doc':'123456'
=
'time':'456= 789'
}
'456789_323456':'223456'= :
{
'id_doc':'1= 23456'
'ti= me':'456789'
<= /span>}
}
'belgium':
{
...
} <= /span>
}


users // all users
{
=A0=A0 =A0=91123=92: //id_user
=A0=A0 =A0{
=A0=A0 =A0 = =A0 =A0=91t_info=92:=A0
{
l_name:'Boris'<= /div>
c_lang=3D'fr&#= 39;

= }
't_readed_docs':=
{
//time + id_doc
'123456789_123456&#= 39;:'123456'
'123458789_136456':'136456'
}
't_shared_docs':
{
//time + id_doc
'123456789_123456'= ;:'123456'
'123458789_136456':'136456'
}
}
}


users_docs // all action by = users on docs
{
=A0=A0 =A0=91123_123456=92: // id_user = + id_doc
=A0=A0 =A0{
'id_doc':'123456'
'id_user':'1= 23'
'd_read= ed':'20110301'
= 'd_shared':'20110301'
}
}
<= br>

user_friends_act // all activity of user frien= ds
{
=A0=A0 =A0=91123=92:// id_user
=A0=A0 =A0{
't_= readed_docs': //all docs readed by my friends
{
'223456_224_123456': // time + id_friend + id_docs
{
'id_friend':'224'
'id_docs':= 9;123456'
'time':'22= 3456'
'c_= type=3D'BOT'
}
}
't_shared_docs': //all docs shared by my friends<= /div>
{
'223456_224_123456': // time + = id_friend + id_docs
{
'id_friend':&#= 39;224'
'= id_docs':'123456'
'time':'22= 3456'
'c_= type=3D'BOT'
}
}
}
}



I know that c= ertain queries are not possible for now like : - All doc of type=3D'BOT= ' favorized by my friend relative (tag) with 'fire' and 'be= lgium' ?



What do you think ?
=


Thank you,


=
Vodnok,


(Please remember= i'm on cassandra since 3 days)




--20cf3043477e9224ec049d9492ab--