incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Design questions/Schema help
Date Tue, 27 Jul 2010 02:40:33 GMT
On 7/26/10 7:06 PM, Dave Viner wrote:
> AFAIK, atomic increments are not available.  There recently has been 
> quite a bit of discussion about them.  So, you might search the archives.
>
>
> Dave Viner
>
> On Mon, Jul 26, 2010 at 7:02 PM, Mark <static.void.dev@gmail.com 
> <mailto:static.void.dev@gmail.com>> wrote:
>
>     On 7/26/10 6:06 PM, Dave Viner wrote:
>>     I'd love to hear other's opinions here... but here are my 2 cents.
>>
>>     With Cassandra, you need to think of the queries - which you've
>>     pretty much done.
>>
>>     For the most popular queries, you could do something like:
>>
>>     <ColumnFamily Name="QueriesCounted"
>>                     ComparesWith="UTF8Type"
>>                     />
>>     And then access it as:
>>     key-space.QueriesCounted['query-foo-bar'] = $count;
>>
>>     This makes it easy to get the count for any particular query.
>>      I'm not sure the best way to store the "top counts" idea.
>>      Perhaps a secondary process which iterates over all the queries
>>     to see which sorts the query values by count, and then stores
>>     them into another ColumnFamily.
>>
>>     You could use the same idea for the last query (session ids by query)
>>
>>     <ColumnFamily Name="QueriesRecorded"
>>                     ComparesWith="UTF8Type"
>>                     ColumnType="super"
>>     CompareSubcolumnsWith="TimeUUIDType"
>>                     />
>>     And then access it as:
>>     key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;
>>
>>     Actually, if you used that idea (queries-recorded), you could
>>     generate the counts and aggregates from that directly in a hadoop
>>     post-processing...
>>
>>     But perhaps others will have better ideas.  If you haven't read
>>     http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go
>>     read it now.  It won't answer your question directly, but will
>>     describe the process of modeling a blog in cassandra so you can
>>     get a sense of the process.
>>
>>     Dave Viner
>>
>>
>>
>>
>>     On Mon, Jul 26, 2010 at 4:46 PM, Mark <static.void.dev@gmail.com
>>     <mailto:static.void.dev@gmail.com>> wrote:
>>
>>         We are thinking about using Cassandra to store our search
>>         logs. Can someone point me in the right direction/lend some
>>         guidance on design? I am new to Cassandra and I am having
>>         trouble wrapping my head around some of these new concepts.
>>         My brain keeps wanting to go back to a RDBMS design.
>>
>>         We will be storing the user query, # of hits returned and
>>         their session id. We would like to be able to answer the
>>         following questions.
>>
>>         - What is the n most popular queries and their counts within
>>         the last x (mins/hours/days/etc). Basically the most popular
>>         searches within a given time range.
>>         - What is the most popular query within the last x where hits
>>         = 0. Same as above but with an extra "where" clause
>>         - For session id x give me all their other queries
>>         - What are all the session ids that searched for 'foos'
>>
>>         We accomplish the above functionality w/ MySQL using 2
>>         tables. One for the raw search log information and the other
>>         to keep the aggregate/running counts of queries.
>>
>>         Would this sort of ad-hoc querying be better implemented
>>         using Hadoop + Hive? If so, should I be storing all this
>>         information in Cassandra then using Hadoop to retrieve it?
>>
>>         Thanks for your suggestions
>>
>>
>     "Perhaps a secondary process which iterates over all the queries
>     to see which sorts the query values by count, and then stores them
>     into another ColumnFamily."
>
>     - I was trying to avoid this. Is there some sort of atomic
>     increment feature available? I guess I could do the same thing we
>     are currently doing which is...
>
>     a) store full query details into table A
>     b) query table B for aggregate count of query 'foo' then store
>     count + 1
>
>
Thanks Ill look into that.

Say I am trying to model something like this:

SearchLogs : {
     foo : {
         TimeUUID_1 : {
             session : af55e102b67c2de27bf12024ac0e7798
             user_id : mr jiggles
         }
         TimeUUID_2 : {
             ....
          }
      }
      bar : {
         .....
     }
}

Would this be my config?

<ColumnFamily Name="SearchLogs"
                     ColumnType="Super"
                     ComparesWith="BytesType"
                     CompareSubcolumnsWith="TimeUUIDType"/>

So basically MyCompany.SearchLogs['foo'] would return an array of hashes 
ordered by time. Is this correct?






Mime
View raw message