Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAFFf+k94XWpaq24WAq4XNWv-nN3Jz3oYmEsYUxi_-yU3wi24-w@mail.gmail.com>
References: 
 <CAP5HrRbDGQfEB8L=NWfDEcY_61XfQPHb+y5oT3jntvvTmCsAvw@mail.gmail.com>
	<CALxZ3YwS+CyBnxdGo2-COTp8dMhL-YVgDR_AcbJxAY2RsyYTug@mail.gmail.com>
	<CAP5HrRYTHRuNxBAs4wo6yPZn_8M-Uz85KYssXy_ad5k1ustdkw@mail.gmail.com>
	<CAJciDs3zGAu1CptTVbrbDLx4gSX0R69JVOS8aa+u2hz6oy-Dsw@mail.gmail.com>
	<CAFFf+k94XWpaq24WAq4XNWv-nN3Jz3oYmEsYUxi_-yU3wi24-w@mail.gmail.com>
Date: Mon, 19 Mar 2012 01:23:26 -0700
Message-ID: 
 <CA+2nF5a5QcQPbdCvZM4LD6wGZQRkb6Eg0-xet-6KcVwt9VAfHA@mail.gmail.com>
Subject: Re: design that mimics twitter tweet search
From: Chris Goffinet <cg@chrisgoffinet.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf302d4bfcc5cb2d04bb944584

--20cf302d4bfcc5cb2d04bb944584
Content-Type: text/plain; charset=ISO-8859-1

We do not use Cassandra for search. We made modifications to Lucene.

Here is a blog post on our engineering section that talks about what we did:

http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html


On Sun, Mar 18, 2012 at 11:22 PM, Tharindu Mathew <mccloud35@gmail.com>wrote:

> Sasha,
>
> It depends on the way you implement I guess... Maybe twitter uses
> Solandra, who's very good at indexing these in different ways but has the
> power of Cassandra underneath...
>
> If your doing your own impl of indexing be mindful that you can break the
> sentence into four words and index or you index the whole sentence. Both
> would produce different results as they can mean a completely different
> thing based on the context.
>
>
> On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. Panov <panov.andy@gmail.com>wrote:
>
>> Why you suppose they did search on Cassandra?
>>
>>
>> On 19 March 2012 00:16, Sasha Dolgy <sdolgy@gmail.com> wrote:
>>
>>> yes -- but given i have two keywords, and want to find all tweets that
>>> have "cassandra" and "bestest" ... means, retrieving all columns + values
>>> in each row, iterating through both to see if tweet id's in one, exist in
>>> the other and finishing up with a consolidated list of tweet id's that only
>>> exist in both.  just seems clunky to me ... ?
>>>
>>>
>>> On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud <benoit@noisette.ch>wrote:
>>>
>>>> The simpliest modeling you could have is using the keyword as key, a
>>>> timestamp/time UUID as column name and the tweetid as value
>>>>
>>>> -> cf['keyword']['timestamp'] = tweetid
>>>>
>>>> then you do a range query to get all tweetid sorted by time (you may
>>>> want them in reverse order) and you can limit to the number of tweets
>>>> displayed on the page.
>>>>
>>>> As some rows can become large, you could use key patitionning by
>>>> concatening for instance keyword and the month and year.
>>>>
>>>>
>>>> 2012/3/18 Sasha Dolgy <sdolgy@gmail.com>:
>>>> > Hi All,
>>>> >
>>>> > With twitter, when I search for words like:  "cassandra is the
>>>> bestest", 4
>>>> > tweets will appear, including one i just did.  My understand that the
>>>> > internals of twitter work in that each word in a tweet is allocated,
>>>> > irrespective of the presence of a  # hash tag, and the tweet id is
>>>> assigned
>>>> > to a row for that word.  What is puzzling to me, and hopeful that
>>>> some smart
>>>> > people on here can shed some light on -- is how would this work with
>>>> > Cassandra?
>>>> >
>>>> > row [ cassandra ]: key -> tweetid  / timestamp
>>>> > row [ bestest ]: key -> tweetid / timestamp
>>>> >
>>>> > I had thought that I could simply pull a list of all column names
>>>> from each
>>>> > row (representing each word) and flag all occurrences (tweet id's)
>>>> that
>>>> > exist in each row ... however, these rows would get quite long over
>>>> time.
>>>> >
>>>> > Am I missing an easier way to get a list of all "tweetid's" that
>>>> exist in
>>>> > multiple rows?
>>>> >
>>>> > --
>>>> > Sasha Dolgy
>>>> > sasha.dolgy@gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> sent from my Nokia 3210
>>>>
>>>
>>>
>>>
>>> --
>>> Sasha Dolgy
>>> sasha.dolgy@gmail.com
>>>
>>
>>
>
>
> --
> Regards,
>
> Tharindu
>
> blog: http://mackiemathew.com/
>
>

--20cf302d4bfcc5cb2d04bb944584
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

We do not use Cassandra for search. We made modifications to Lucene.<div><b=
r></div><div>Here is a blog post on our engineering section that talks abou=
t what we did:</div><div><br></div><div><a href=3D"http://engineering.twitt=
er.com/2011/04/twitter-search-is-now-3x-faster_1656.html">http://engineerin=
g.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html</a></div>
<div><br><br><div class=3D"gmail_quote">On Sun, Mar 18, 2012 at 11:22 PM, T=
harindu Mathew <span dir=3D"ltr">&lt;<a href=3D"mailto:mccloud35@gmail.com"=
>mccloud35@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">
Sasha,<br><br>It depends on the way you implement I guess... Maybe twitter =
uses Solandra, who&#39;s very good at indexing these in different ways but =
has the power of Cassandra underneath...<br><br>If your doing your own impl=
 of indexing be mindful that you can break the sentence into four words and=
 index or you index the whole sentence. Both would produce different result=
s as they can mean a completely different thing based on the context.<div>
<div></div><div class=3D"h5"><br>
<br><div class=3D"gmail_quote">On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. P=
anov <span dir=3D"ltr">&lt;<a href=3D"mailto:panov.andy@gmail.com" target=
=3D"_blank">panov.andy@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">

Why you suppose they did search on Cassandra?<div><div><br><br><div class=
=3D"gmail_quote">On 19 March 2012 00:16, Sasha Dolgy <span dir=3D"ltr">&lt;=
<a href=3D"mailto:sdolgy@gmail.com" target=3D"_blank">sdolgy@gmail.com</a>&=
gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
yes -- but given i have two keywords, and want to find all tweets that have=
 &quot;cassandra&quot; and &quot;bestest&quot; ... means, retrieving all co=
lumns + values in each row, iterating through both to see if tweet id&#39;s=
 in one, exist in the other and finishing up with a consolidated list of tw=
eet id&#39;s that only exist in both. =A0just seems clunky to me ... ?<div>


<div><br>

<br><div class=3D"gmail_quote">On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perr=
oud <span dir=3D"ltr">&lt;<a href=3D"mailto:benoit@noisette.ch" target=3D"_=
blank">benoit@noisette.ch</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">


The simpliest modeling you could have is using the keyword as key, a<br>
timestamp/time UUID as column name and the tweetid as value<br>
<br>
-&gt; cf[&#39;keyword&#39;][&#39;timestamp&#39;] =3D tweetid<br>
<br>
then you do a range query to get all tweetid sorted by time (you may<br>
want them in reverse order) and you can limit to the number of tweets<br>
displayed on the page.<br>
<br>
As some rows can become large, you could use key patitionning by<br>
concatening for instance keyword and the month and year.<br>
<br>
<br>
2012/3/18 Sasha Dolgy &lt;<a href=3D"mailto:sdolgy@gmail.com" target=3D"_bl=
ank">sdolgy@gmail.com</a>&gt;:<br>
<div><div>&gt; Hi All,<br>
&gt;<br>
&gt; With twitter, when I search for words like: =A0&quot;cassandra is the =
bestest&quot;, 4<br>
&gt; tweets will appear, including one i just did. =A0My understand that th=
e<br>
&gt; internals of twitter work in that each word in a tweet is allocated,<b=
r>
&gt; irrespective of the presence of a =A0# hash tag, and the tweet id is a=
ssigned<br>
&gt; to a row for that word. =A0What is puzzling to me, and hopeful that so=
me smart<br>
&gt; people on here can shed some light on -- is how would this work with<b=
r>
&gt; Cassandra?<br>
&gt;<br>
&gt; row [ cassandra ]: key -&gt; tweetid =A0/ timestamp<br>
&gt; row [ bestest ]: key -&gt; tweetid / timestamp<br>
&gt;<br>
&gt; I had thought that I could simply pull a list of all column names from=
 each<br>
&gt; row (representing each word) and flag all=A0occurrences=A0(tweet id=
9;s) that<br>
&gt; exist in each row ... however, these rows would get quite long over ti=
me.<br>
&gt;<br>
&gt; Am I missing an easier way to get a list of all &quot;tweetid&#39;s&qu=
ot; that exist in<br>
&gt; multiple rows?<br>
&gt;<br>
&gt; --<br>
&gt; Sasha Dolgy<br>
&gt; <a href=3D"mailto:sasha.dolgy@gmail.com" target=3D"_blank">sasha.dolgy=
@gmail.com</a><br>
<br>
<br>
<br>
</div></div><span><font color=3D"#888888">--<br>
sent from my Nokia 3210<br>
</font></span></blockquote></div><br><br clear=3D"all"><div><br></div></div=
></div><span><font color=3D"#888888">-- <br>Sasha Dolgy<br><a href=3D"mailt=
o:sasha.dolgy@gmail.com" target=3D"_blank">sasha.dolgy@gmail.com</a><br>

</font></span></blockquote></div><br>
</div></div></blockquote></div><br><br clear=3D"all"><br></div></div><font =
color=3D"#888888">-- <br>Regards,<br><br>Tharindu<div><br></div><div>blog:=
=A0<a href=3D"http://mackiemathew.com/" target=3D"_blank">http://mackiemath=
ew.com/</a></div>
<br>
</font></blockquote></div><br></div>

--20cf302d4bfcc5cb2d04bb944584--