cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Design questions/Schema help
Date Tue, 27 Jul 2010 02:02:04 GMT
On 7/26/10 6:06 PM, Dave Viner wrote:
> I'd love to hear other's opinions here... but here are my 2 cents.
>
> With Cassandra, you need to think of the queries - which you've pretty 
> much done.
>
> For the most popular queries, you could do something like:
>
> <ColumnFamily Name="QueriesCounted"
>                 ComparesWith="UTF8Type"
>                 />
> And then access it as:
> key-space.QueriesCounted['query-foo-bar'] = $count;
>
> This makes it easy to get the count for any particular query.  I'm not 
> sure the best way to store the "top counts" idea.  Perhaps a secondary 
> process which iterates over all the queries to see which sorts the 
> query values by count, and then stores them into another ColumnFamily.
>
> You could use the same idea for the last query (session ids by query)
>
> <ColumnFamily Name="QueriesRecorded"
>                 ComparesWith="UTF8Type"
>                 ColumnType="super"
> CompareSubcolumnsWith="TimeUUIDType"
>                 />
> And then access it as:
> key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;
>
> Actually, if you used that idea (queries-recorded), you could generate 
> the counts and aggregates from that directly in a hadoop 
> post-processing...
>
> But perhaps others will have better ideas.  If you haven't read 
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go read 
> it now.  It won't answer your question directly, but will describe the 
> process of modeling a blog in cassandra so you can get a sense of the 
> process.
>
> Dave Viner
>
>
>
>
> On Mon, Jul 26, 2010 at 4:46 PM, Mark <static.void.dev@gmail.com 
> <mailto:static.void.dev@gmail.com>> wrote:
>
>     We are thinking about using Cassandra to store our search logs.
>     Can someone point me in the right direction/lend some guidance on
>     design? I am new to Cassandra and I am having trouble wrapping my
>     head around some of these new concepts. My brain keeps wanting to
>     go back to a RDBMS design.
>
>     We will be storing the user query, # of hits returned and their
>     session id. We would like to be able to answer the following
>     questions.
>
>     - What is the n most popular queries and their counts within the
>     last x (mins/hours/days/etc). Basically the most popular searches
>     within a given time range.
>     - What is the most popular query within the last x where hits = 0.
>     Same as above but with an extra "where" clause
>     - For session id x give me all their other queries
>     - What are all the session ids that searched for 'foos'
>
>     We accomplish the above functionality w/ MySQL using 2 tables. One
>     for the raw search log information and the other to keep the
>     aggregate/running counts of queries.
>
>     Would this sort of ad-hoc querying be better implemented using
>     Hadoop + Hive? If so, should I be storing all this information in
>     Cassandra then using Hadoop to retrieve it?
>
>     Thanks for your suggestions
>
>
"Perhaps a secondary process which iterates over all the queries to see 
which sorts the query values by count, and then stores them into another 
ColumnFamily."

- I was trying to avoid this. Is there some sort of atomic increment 
feature available? I guess I could do the same thing we are currently 
doing which is...

a) store full query details into table A
b) query table B for aggregate count of query 'foo' then store count + 1

Mime
View raw message