Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <659AF4B5-9C92-4FBA-B20C-00DCDDB82E21@grnoc.iu.edu>
References: <649A15D5-25BF-47A0-B6D0-007EA1C93947@grnoc.iu.edu>
	<AANLkTi=-8xN1tn3zvOLpO6KxyuOm9k3xuWYsrqHxLw-O@mail.gmail.com>
	<F8B9AA28-B484-4275-95F6-3493179484BF@grnoc.iu.edu>
	<AANLkTimXmnKM_J9ak2QLGcvSVAOxqRArM06ZnLAD1Y0Y@mail.gmail.com>
	<659AF4B5-9C92-4FBA-B20C-00DCDDB82E21@grnoc.iu.edu>
Date: Wed, 6 Oct 2010 15:00:35 -0500
Message-ID: <AANLkTi=Hz+N2=wAHObKWyWLHoHQJ-+Toey4FqWe3P19J@mail.gmail.com>
Subject: Re: get keys based on values??
From: Matthew Dennis <mdennis@riptano.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0015174c1c121340620491f83b36

--0015174c1c121340620491f83b36
Content-Type: text/plain; charset=ISO-8859-1

As jbellis mentioned, the secondary indexes with > will work for this but in
the mean time you can still index this manually in .6 (which will continue
to work in .7 if need be).

There are several ways to attack this now.  If you don't have too many users
you can have a row with "age" as the row key and then each column name will
be the age of the user and the column value would be the row key for the
user.  C* will order the columns in that row by the column name so you could
slice on them to get all the ids for the users in question.  Keep in mind
that this will create one row with an entry for every user so if you have
lots of users that could be a big row with all the associated problems.

If you have too much data for the above, you can create rows with the age as
the row key and column names as the row id with that age.  Then when you
want to query for all the users with ages > 33 you would start at 34 then
issue slice calls to each of those rows with a count of say 1000 to get the
user ids, then multiget with those ids to get the users.

If you have two much data for that and/or want better distribution of your
indexes across the cluster, you can add a second level of indexes.  You have
one row with a row key of 33 that contains UUID column names representing
the row keys of other rows that all contain user ids that have age 33.  Then
when you want to look up users with age 33, you query that top level row,
get all the UUIDs for the other rows, query those to give you the ids of
users with that age and then retrieve them.  When you add a new user, query
that first row to get all the row keys for the rows containing the users age
and pick one randomly to write to.  This has the obvious problem of reading
before writing so keep that in mind.

An alternative to having a second level of indexing but still splitting up
your index to multiple rows you could have multiple hash functions (as many
hash functions as rows you want to split up).  Then when given 33, you pass
it through all your different hash functions to return the rows that contain
ids for users with age 33.

On Wed, Oct 6, 2010 at 1:49 PM, Brayton Thompson <thompsbp@grnoc.iu.edu>wrote:

> Ok, let me tweak the scenario a tiny bit. What if I wanted something
> extremely arbitrary, for instance... simple comparisons like a WHERE clause
> in SQL....
>
> get Users.someuser['uuid'] where Users.someuser['age']  >  33
>
> From what i've read this functionality defeats the point of Cassandra
> because instead of indexing directly to a value C* would have to got to a
> value and run a check for every entry.  Am I correct here?
>
> So would my best bet be to simply get ALL of my users uuids and ages, then
> throw away all of those that do not meet the required test?
>
> Thank you.
>
> On Oct 6, 2010, at 2:09 PM, Matthew Dennis wrote:
>
> As Norman said, secondary indexes are only in .7 but you can create
> standard indexes in both .6 and .7
>
> Basically have a email_domain_idx CF where the row key is the domain and
> the column names have the row id of the user (the column value is unused in
> this scenario).  This sounds basically like what you described in your
> original post.  That's a very common way to do it in Cassandra (C*).  This
> is not all that different to what MySQL, PGSQl, etc do for you
> automatically, just in C* you have to do it manually and remember to write
> to that index column family whenever you write to the users CF.
>
> On Wed, Oct 6, 2010 at 12:56 PM, Brayton Thompson <thompsbp@grnoc.iu.edu>wrote:
>
>> Are secondary index's available in .6.5? or are they only in .7?
>>
>> On Oct 6, 2010, at 1:15 PM, Tyler Hobbs wrote:
>>
>> If you're interested in only checking part of a column's value, you can
>> generally
>> just store that part of the value in a different column.  So, have an
>> "email_addr" column
>> and a "email_domain" column, which stores "aol.com", for example.
>>
>> Then you can just use a secondary index on the "email_domain" column.
>>
>> - Tyler
>>
>> On Wed, Oct 6, 2010 at 10:33 AM, Brayton Thompson <thompsbp@grnoc.iu.edu>wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Ok, I am VERY new to Cassandra and trying to get my head around its core
>>> ideas.
>>>
>>> So lets say I have a CF of Users that contains all the info I would ever
>>> want to know about them. One day I decide(for some reason) that I want to
>>> send a mass email to only the users with AOL email addresses. Is there a
>>> mechanism for getting only keys whose email attribute contains the string @
>>> aol.com ? Or is this frowned upon? I could also envision separate CF's
>>> for each email type; that stored values to use as keys into my Users CF. Say
>>> the AOL CF contains the usernames of everyone that has an aol account. So I
>>> would pull all of the keys from that CF and then use them to index into the
>>> Users CF to pull their email addresses.  It seems to me that this is
>>> redundant. So I would like your thoughts on my example.
>>>
>>> Thank you,
>>> Brayton Thompson
>>> thompsbp@grnoc.iu.edu
>>> Global Research Network Operation Center
>>> Indiana University
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG/MacGPG2 v2.0.14 (Darwin)
>>>
>>> iQIcBAEBAgAGBQJMrJa1AAoJENisXTckM+p9ffcP/1UmNDyWxDnOu41ZRcVwmJiE
>>> +47QxqNc57WmdXX86FUvcauhPFFNZfbrbGwA61sof1sktSOL83osOXQuOfGr5GvT
>>> tulU3+rQ1B+ea0x+aBESbKZwXHxckLGdst2Hro1eCVXEna+VvqkxNJ2rvYzE3hNM
>>> FTNBWDIv3JbOChTYBnycBqg1iG5yMDkc2xEHlaiw9S/VsOPU18pPYrf42eoSqgnk
>>> /rZDCxxiThznuaLI70QnU3O7ZTiyXpavN8BUW6KoeDZNAypgg1AayhEL2d67zZWu
>>> qtnGEpoIeieinjccWMpkUrv2f14CZQ5gbJSLwPdoNLItYLnFvGHg0Ca/hXhrkIDr
>>> BqnA0R5w2YHB+5p84gvj1NTRE0O2kXcUHkLDDBvnlLKUOUkoDyqr5tGAIwHhIwA7
>>> hpko76CyGN84bS8Kma+1D6e8wg9zqfiS9mvvErJCUOwyU5e+XeoiCdyhwgDHJKlW
>>> T5UjMXdAHwyZly48J5l6jEJastHsL1wKAHeV/NlQ1gEx2CmnnJ0lBPDPqlT5Lxdb
>>> uQFzS/YhFzxWL2gApHKF8EdCz4jFbPUggYYPsVgfYkNNBISgcIiQaEIIPkri96vb
>>> V/xhnxLrFCO20NnGQ5PCTzCnZptyc3V+9WI542fnRGcS8SbF+N5BdLzoJBjtidrI
>>> a/Nps/KUhJ5kVzJ0o8H3
>>> =oBhH
>>> -----END PGP SIGNATURE-----
>>>
>>
>>
>>
>
>
> --
> Riptano
> Software and Support for Apache Cassandra
> http://www.riptano.com/
> mdennis@riptano.com
> m: 512.587.0900 f: 866.583.2068
>
>
>


-- 
Riptano
Software and Support for Apache Cassandra
http://www.riptano.com/
mdennis@riptano.com
m: 512.587.0900 f: 866.583.2068

--0015174c1c121340620491f83b36
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

As jbellis mentioned, the secondary indexes with &gt; will work for this bu=
t in the mean time you can still index this manually in .6 (which will cont=
inue to work in .7 if need be).<br><br>There are several ways to attack thi=
s now.=A0 If you don&#39;t have too many users you can have a row with &quo=
t;age&quot; as the row key and then each column name will be the age of the=
 user and the column value would be the row key for the user.=A0 C* will or=
der the columns in that row by the column name so you could slice on them t=
o get all the ids for the users in question.=A0 Keep in mind that this will=
 create one row with an entry for every user so if you have lots of users t=
hat could be a big row with all the associated problems.<br>
<br>If you have too much data for the above, you can create rows with the a=
ge as the row key and column names as the row id with that age.=A0 Then whe=
n you want to query for all the users with ages &gt; 33 you would start at =
34 then issue slice calls to each of those rows with a count of say 1000 to=
 get the user ids, then multiget with those ids to get the users.<br>
<br>If you have two much data for that and/or want better distribution of y=
our indexes across the cluster, you can add a second level of indexes.=A0 Y=
ou have one row with a row key of 33 that contains UUID column names repres=
enting the row keys of other rows that all contain user ids that have age 3=
3.=A0 Then when you want to look up users with age 33, you query that top l=
evel row, get all the UUIDs for the other rows, query those to give you the=
 ids of users with that age and then retrieve them.=A0 When you add a new u=
ser, query that first row to get all the row keys for the rows containing t=
he users age and pick one randomly to write to.=A0 This has the obvious pro=
blem of reading before writing so keep that in mind.<br>
<br>An alternative to having a second level of indexing but still splitting=
 up your index to multiple rows you could have multiple hash functions (as =
many hash functions as rows you want to split up).=A0 Then when given 33, y=
ou pass it through all your different hash functions to return the rows tha=
t contain ids for users with age 33.<br>
<br><div class=3D"gmail_quote">On Wed, Oct 6, 2010 at 1:49 PM, Brayton Thom=
pson <span dir=3D"ltr">&lt;<a href=3D"mailto:thompsbp@grnoc.iu.edu">thompsb=
p@grnoc.iu.edu</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204=
); padding-left: 1ex;">
<div style=3D"word-wrap: break-word;">Ok, let me tweak the scenario a tiny =
bit. What if I wanted something extremely arbitrary, for instance... simple=
 comparisons like a WHERE clause in SQL.... =A0=A0<div><br></div><div>get U=
sers.someuser[&#39;uuid&#39;] where Users.someuser[&#39;age&#39;] =A0&gt; =
=A033<br>
<div><div><br></div><div>From what i&#39;ve read this functionality defeats=
 the point of Cassandra because instead of indexing directly to a value C* =
would have to got to a value and run a check for every entry. =A0Am I corre=
ct here?</div>
<div><br></div><div>So would my best bet be to simply get ALL of my users u=
uids and ages, then throw away all of those that do not meet the required t=
est?</div><div><br></div><div>Thank you.</div><div><div></div><div class=3D=
"h5">
<div><br></div><div>On Oct 6, 2010, at 2:09 PM, Matthew Dennis wrote:</div>=
<br><blockquote type=3D"cite">As Norman said, secondary indexes are only in=
 .7 but you can create standard indexes in both .6 and .7<br><br>Basically =
have a email_domain_idx CF where the row key is the domain and the column n=
ames have the row id of the user (the column value is unused in this scenar=
io).=A0 This sounds basically like what you described in your original post=
.=A0 That&#39;s a very common way to do it in Cassandra (C*).=A0 This is no=
t all that different to what MySQL, PGSQl, etc do for you automatically, ju=
st in C* you have to do it manually and remember to write to that index col=
umn family whenever you write to the users CF.<br>

<br><div class=3D"gmail_quote">On Wed, Oct 6, 2010 at 12:56 PM, Brayton Tho=
mpson <span dir=3D"ltr">&lt;<a href=3D"mailto:thompsbp@grnoc.iu.edu" target=
=3D"_blank">thompsbp@grnoc.iu.edu</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px soli=
d rgb(204, 204, 204); padding-left: 1ex;">

<div style=3D"word-wrap: break-word;">Are secondary index&#39;s available i=
n .6.5? or are they only in .7?<div><div></div><div><div><br><div><div>On O=
ct 6, 2010, at 1:15 PM, Tyler Hobbs wrote:</div><br><blockquote type=3D"cit=
e">

If you&#39;re interested in only checking part of a column&#39;s value, you=
 can generally<br>just store that part of the value in a different column.=
=A0 So, have an &quot;email_addr&quot; column<br>and a &quot;email_domain&q=
uot; column, which stores &quot;<a href=3D"http://aol.com/" target=3D"_blan=
k">aol.com</a>&quot;, for example.<br>


<br>Then you can just use a secondary index on the &quot;email_domain&quot;=
 column.<br><br>- Tyler<br><br><div class=3D"gmail_quote">On Wed, Oct 6, 20=
10 at 10:33 AM, Brayton Thompson <span dir=3D"ltr">&lt;<a href=3D"mailto:th=
ompsbp@grnoc.iu.edu" target=3D"_blank">thompsbp@grnoc.iu.edu</a>&gt;</span>=
 wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">-----BEGIN PGP SI=
GNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
Ok, I am VERY new to Cassandra and trying to get my head around its core id=
eas.<br>
<br>
So lets say I have a CF of Users that contains all the info I would ever wa=
nt to know about them. One day I decide(for some reason) that I want to sen=
d a mass email to only the users with AOL email addresses. Is there a mecha=
nism for getting only keys whose email attribute contains the string @<a hr=
ef=3D"http://aol.com/" target=3D"_blank">aol.com</a> ? Or is this frowned u=
pon? I could also envision separate CF&#39;s for each email type; that stor=
ed values to use as keys into my Users CF. Say the AOL CF contains the user=
names of everyone that has an aol account. So I would pull all of the keys =
from that CF and then use them to index into the Users CF to pull their ema=
il addresses. =A0It seems to me that this is redundant. So I would like you=
r thoughts on my example.<br>


<br>
Thank you,<br>
Brayton Thompson<br>
<a href=3D"mailto:thompsbp@grnoc.iu.edu" target=3D"_blank">thompsbp@grnoc.i=
u.edu</a><br>
Global Research Network Operation Center<br>
Indiana University<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG/MacGPG2 v2.0.14 (Darwin)<br>
<br>
iQIcBAEBAgAGBQJMrJa1AAoJENisXTckM+p9ffcP/1UmNDyWxDnOu41ZRcVwmJiE<br>
+47QxqNc57WmdXX86FUvcauhPFFNZfbrbGwA61sof1sktSOL83osOXQuOfGr5GvT<br>
tulU3+rQ1B+ea0x+aBESbKZwXHxckLGdst2Hro1eCVXEna+VvqkxNJ2rvYzE3hNM<br>
FTNBWDIv3JbOChTYBnycBqg1iG5yMDkc2xEHlaiw9S/VsOPU18pPYrf42eoSqgnk<br>
/rZDCxxiThznuaLI70QnU3O7ZTiyXpavN8BUW6KoeDZNAypgg1AayhEL2d67zZWu<br>
qtnGEpoIeieinjccWMpkUrv2f14CZQ5gbJSLwPdoNLItYLnFvGHg0Ca/hXhrkIDr<br>
BqnA0R5w2YHB+5p84gvj1NTRE0O2kXcUHkLDDBvnlLKUOUkoDyqr5tGAIwHhIwA7<br>
hpko76CyGN84bS8Kma+1D6e8wg9zqfiS9mvvErJCUOwyU5e+XeoiCdyhwgDHJKlW<br>
T5UjMXdAHwyZly48J5l6jEJastHsL1wKAHeV/NlQ1gEx2CmnnJ0lBPDPqlT5Lxdb<br>
uQFzS/YhFzxWL2gApHKF8EdCz4jFbPUggYYPsVgfYkNNBISgcIiQaEIIPkri96vb<br>
V/xhnxLrFCO20NnGQ5PCTzCnZptyc3V+9WI542fnRGcS8SbF+N5BdLzoJBjtidrI<br>
a/Nps/KUhJ5kVzJ0o8H3<br>
=3DoBhH<br>
-----END PGP SIGNATURE-----<br>
</blockquote></div><br>
</blockquote></div><br></div></div></div></div></blockquote></div><br><br c=
lear=3D"all"><br>-- <br>Riptano<br>Software and Support for Apache Cassandr=
a<br><a href=3D"http://www.riptano.com/" target=3D"_blank">http://www.ripta=
no.com/</a><br>

<a href=3D"mailto:mdennis@riptano.com" target=3D"_blank">mdennis@riptano.co=
m</a><br>m: 512.587.0900 f: 866.583.2068<br>
</blockquote></div></div></div><br></div></div></blockquote></div><br><br c=
lear=3D"all"><br>-- <br>Riptano<br>Software and Support for Apache Cassandr=
a<br><a href=3D"http://www.riptano.com/" target=3D"_blank">http://www.ripta=
no.com/</a><br>
<a href=3D"mailto:mdennis@riptano.com" target=3D"_blank">mdennis@riptano.co=
m</a><br>m: 512.587.0900 f: 866.583.2068<br>

--0015174c1c121340620491f83b36--