Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42A519C72 for ; Mon, 19 Mar 2012 08:24:05 +0000 (UTC) Received: (qmail 9050 invoked by uid 500); 19 Mar 2012 08:24:02 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 8759 invoked by uid 500); 19 Mar 2012 08:23:56 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 8712 invoked by uid 99); 19 Mar 2012 08:23:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Mar 2012 08:23:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.172] (HELO mail-qc0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Mar 2012 08:23:48 +0000 Received: by qcsq13 with SMTP id q13so1321463qcs.31 for ; Mon, 19 Mar 2012 01:23:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=YFbVe3ymUYO38bPQH4Cd+cdHOKZS1jh7XIl8u0GnC+c=; b=KwFQYzEYgJYXwzMoWsz9ZW3ad9Dn9e0taGJYMIflHPx6OYaloKR15WTmVRsMHDWn6x wYiMcIUg5dfpRLCFJ2CAgqUvzrG4oggZ76FtrQzu9gsPRyNt0v70nDNzMn9+xzgWF7PB qjBwQ83r43LxRzNDIALPrhT6CVgTMu8L+XqEFzQmrDyBIfBavHbqy4k7FB8doYw5AMCY UBN76jsGQmvOvmeZA3utf75O7gaVfqwJCFNrHiRgINuT6LsYcYglOasDwz3K8LCIydwk pgZWwP/qrOjCX4skTe8blrdkQOezlsV0GsQeaSvEVh6XfN5WuHD3C8kAEvnwwVZ2VJ8f 6etA== MIME-Version: 1.0 Received: by 10.224.182.201 with SMTP id cd9mr13822312qab.92.1332145406986; Mon, 19 Mar 2012 01:23:26 -0700 (PDT) Received: by 10.229.233.73 with HTTP; Mon, 19 Mar 2012 01:23:26 -0700 (PDT) In-Reply-To: References: Date: Mon, 19 Mar 2012 01:23:26 -0700 Message-ID: Subject: Re: design that mimics twitter tweet search From: Chris Goffinet To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf302d4bfcc5cb2d04bb944584 X-Gm-Message-State: ALoCoQmzeRmV8BNaLLoJ6QkhpdWC71moqU07LW6EVgRVK1I1UkKtTMtIzariuw3WzLFJxXbqpgQp X-Virus-Checked: Checked by ClamAV on apache.org --20cf302d4bfcc5cb2d04bb944584 Content-Type: text/plain; charset=ISO-8859-1 We do not use Cassandra for search. We made modifications to Lucene. Here is a blog post on our engineering section that talks about what we did: http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html On Sun, Mar 18, 2012 at 11:22 PM, Tharindu Mathew wrote: > Sasha, > > It depends on the way you implement I guess... Maybe twitter uses > Solandra, who's very good at indexing these in different ways but has the > power of Cassandra underneath... > > If your doing your own impl of indexing be mindful that you can break the > sentence into four words and index or you index the whole sentence. Both > would produce different results as they can mean a completely different > thing based on the context. > > > On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. Panov wrote: > >> Why you suppose they did search on Cassandra? >> >> >> On 19 March 2012 00:16, Sasha Dolgy wrote: >> >>> yes -- but given i have two keywords, and want to find all tweets that >>> have "cassandra" and "bestest" ... means, retrieving all columns + values >>> in each row, iterating through both to see if tweet id's in one, exist in >>> the other and finishing up with a consolidated list of tweet id's that only >>> exist in both. just seems clunky to me ... ? >>> >>> >>> On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perroud wrote: >>> >>>> The simpliest modeling you could have is using the keyword as key, a >>>> timestamp/time UUID as column name and the tweetid as value >>>> >>>> -> cf['keyword']['timestamp'] = tweetid >>>> >>>> then you do a range query to get all tweetid sorted by time (you may >>>> want them in reverse order) and you can limit to the number of tweets >>>> displayed on the page. >>>> >>>> As some rows can become large, you could use key patitionning by >>>> concatening for instance keyword and the month and year. >>>> >>>> >>>> 2012/3/18 Sasha Dolgy : >>>> > Hi All, >>>> > >>>> > With twitter, when I search for words like: "cassandra is the >>>> bestest", 4 >>>> > tweets will appear, including one i just did. My understand that the >>>> > internals of twitter work in that each word in a tweet is allocated, >>>> > irrespective of the presence of a # hash tag, and the tweet id is >>>> assigned >>>> > to a row for that word. What is puzzling to me, and hopeful that >>>> some smart >>>> > people on here can shed some light on -- is how would this work with >>>> > Cassandra? >>>> > >>>> > row [ cassandra ]: key -> tweetid / timestamp >>>> > row [ bestest ]: key -> tweetid / timestamp >>>> > >>>> > I had thought that I could simply pull a list of all column names >>>> from each >>>> > row (representing each word) and flag all occurrences (tweet id's) >>>> that >>>> > exist in each row ... however, these rows would get quite long over >>>> time. >>>> > >>>> > Am I missing an easier way to get a list of all "tweetid's" that >>>> exist in >>>> > multiple rows? >>>> > >>>> > -- >>>> > Sasha Dolgy >>>> > sasha.dolgy@gmail.com >>>> >>>> >>>> >>>> -- >>>> sent from my Nokia 3210 >>>> >>> >>> >>> >>> -- >>> Sasha Dolgy >>> sasha.dolgy@gmail.com >>> >> >> > > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > > --20cf302d4bfcc5cb2d04bb944584 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable We do not use Cassandra for search. We made modifications to Lucene.
Here is a blog post on our engineering section that talks abou= t what we did:



On Sun, Mar 18, 2012 at 11:22 PM, T= harindu Mathew <mccloud35@gmail.com> wrote:


On Mon, Mar 19, 2012 at 7:35 AM, Andrey V. P= anov <panov.andy@gmail.com> wrote:
Why you suppose they did search on Cassandra?


On 19 March 2012 00:16, Sasha Dolgy <= sdolgy@gmail.com&= gt; wrote:
yes -- but given i have two keywords, and want to find all tweets that have= "cassandra" and "bestest" ... means, retrieving all co= lumns + values in each row, iterating through both to see if tweet id's= in one, exist in the other and finishing up with a consolidated list of tw= eet id's that only exist in both. =A0just seems clunky to me ... ?


On Sun, Mar 18, 2012 at 4:12 PM, Benoit Perr= oud <benoit@noisette.ch> wrote:
The simpliest modeling you could have is using the keyword as key, a
timestamp/time UUID as column name and the tweetid as value

-> cf['keyword']['timestamp'] =3D tweetid

then you do a range query to get all tweetid sorted by time (you may
want them in reverse order) and you can limit to the number of tweets
displayed on the page.

As some rows can become large, you could use key patitionning by
concatening for instance keyword and the month and year.


2012/3/18 Sasha Dolgy <sdolgy@gmail.com>:
> Hi All,
>
> With twitter, when I search for words like: =A0"cassandra is the = bestest", 4
> tweets will appear, including one i just did. =A0My understand that th= e
> internals of twitter work in that each word in a tweet is allocated, > irrespective of the presence of a =A0# hash tag, and the tweet id is a= ssigned
> to a row for that word. =A0What is puzzling to me, and hopeful that so= me smart
> people on here can shed some light on -- is how would this work with > Cassandra?
>
> row [ cassandra ]: key -> tweetid =A0/ timestamp
> row [ bestest ]: key -> tweetid / timestamp
>
> I had thought that I could simply pull a list of all column names from= each
> row (representing each word) and flag all=A0occurrences=A0(tweet id= 9;s) that
> exist in each row ... however, these rows would get quite long over ti= me.
>
> Am I missing an easier way to get a list of all "tweetid's&qu= ot; that exist in
> multiple rows?
>
> --
> Sasha Dolgy
> sasha.dolgy= @gmail.com



--
sent from my Nokia 3210



--
Sasha Dolgy
sasha.dolgy@gmail.com




--
Regards,

Tharindu



--20cf302d4bfcc5cb2d04bb944584--