incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <>
Subject Re: Design questions/Schema help
Date Tue, 27 Jul 2010 02:02:04 GMT
On 7/26/10 6:06 PM, Dave Viner wrote:
> I'd love to hear other's opinions here... but here are my 2 cents.
> With Cassandra, you need to think of the queries - which you've pretty 
> much done.
> For the most popular queries, you could do something like:
> <ColumnFamily Name="QueriesCounted"
>                 ComparesWith="UTF8Type"
>                 />
> And then access it as:
> key-space.QueriesCounted['query-foo-bar'] = $count;
> This makes it easy to get the count for any particular query.  I'm not 
> sure the best way to store the "top counts" idea.  Perhaps a secondary 
> process which iterates over all the queries to see which sorts the 
> query values by count, and then stores them into another ColumnFamily.
> You could use the same idea for the last query (session ids by query)
> <ColumnFamily Name="QueriesRecorded"
>                 ComparesWith="UTF8Type"
>                 ColumnType="super"
> CompareSubcolumnsWith="TimeUUIDType"
>                 />
> And then access it as:
> key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;
> Actually, if you used that idea (queries-recorded), you could generate 
> the counts and aggregates from that directly in a hadoop 
> post-processing...
> But perhaps others will have better ideas.  If you haven't read 
>, go read 
> it now.  It won't answer your question directly, but will describe the 
> process of modeling a blog in cassandra so you can get a sense of the 
> process.
> Dave Viner
> On Mon, Jul 26, 2010 at 4:46 PM, Mark < 
> <>> wrote:
>     We are thinking about using Cassandra to store our search logs.
>     Can someone point me in the right direction/lend some guidance on
>     design? I am new to Cassandra and I am having trouble wrapping my
>     head around some of these new concepts. My brain keeps wanting to
>     go back to a RDBMS design.
>     We will be storing the user query, # of hits returned and their
>     session id. We would like to be able to answer the following
>     questions.
>     - What is the n most popular queries and their counts within the
>     last x (mins/hours/days/etc). Basically the most popular searches
>     within a given time range.
>     - What is the most popular query within the last x where hits = 0.
>     Same as above but with an extra "where" clause
>     - For session id x give me all their other queries
>     - What are all the session ids that searched for 'foos'
>     We accomplish the above functionality w/ MySQL using 2 tables. One
>     for the raw search log information and the other to keep the
>     aggregate/running counts of queries.
>     Would this sort of ad-hoc querying be better implemented using
>     Hadoop + Hive? If so, should I be storing all this information in
>     Cassandra then using Hadoop to retrieve it?
>     Thanks for your suggestions
"Perhaps a secondary process which iterates over all the queries to see 
which sorts the query values by count, and then stores them into another 

- I was trying to avoid this. Is there some sort of atomic increment 
feature available? I guess I could do the same thing we are currently 
doing which is...

a) store full query details into table A
b) query table B for aggregate count of query 'foo' then store count + 1

View raw message