cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Viner <davevi...@pobox.com>
Subject Re: Design questions/Schema help
Date Tue, 27 Jul 2010 01:06:48 GMT
I'd love to hear other's opinions here... but here are my 2 cents.

With Cassandra, you need to think of the queries - which you've pretty much
done.

For the most popular queries, you could do something like:

            <ColumnFamily Name="QueriesCounted"
                ComparesWith="UTF8Type"
                />
And then access it as:
key-space.QueriesCounted['query-foo-bar'] = $count;

This makes it easy to get the count for any particular query.  I'm not sure
the best way to store the "top counts" idea.  Perhaps a secondary process
which iterates over all the queries to see which sorts the query values by
count, and then stores them into another ColumnFamily.

You could use the same idea for the last query (session ids by query)

            <ColumnFamily Name="QueriesRecorded"
                ComparesWith="UTF8Type"
                ColumnType="super"
CompareSubcolumnsWith="TimeUUIDType"
                />
And then access it as:
key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;

Actually, if you used that idea (queries-recorded), you could generate the
counts and aggregates from that directly in a hadoop post-processing...

But perhaps others will have better ideas.  If you haven't read
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go read it
now.  It won't answer your question directly, but will describe the process
of modeling a blog in cassandra so you can get a sense of the process.

Dave Viner




On Mon, Jul 26, 2010 at 4:46 PM, Mark <static.void.dev@gmail.com> wrote:

>  We are thinking about using Cassandra to store our search logs. Can
> someone point me in the right direction/lend some guidance on design? I am
> new to Cassandra and I am having trouble wrapping my head around some of
> these new concepts. My brain keeps wanting to go back to a RDBMS design.
>
> We will be storing the user query, # of hits returned and their session id.
> We would like to be able to answer the following questions.
>
> - What is the n most popular queries and their counts within the last x
> (mins/hours/days/etc). Basically the most popular searches within a given
> time range.
> - What is the most popular query within the last x where hits = 0. Same as
> above but with an extra "where" clause
> - For session id x give me all their other queries
> - What are all the session ids that searched for 'foos'
>
> We accomplish the above functionality w/ MySQL using 2 tables. One for the
> raw search log information and the other to keep the aggregate/running
> counts of queries.
>
> Would this sort of ad-hoc querying be better implemented using Hadoop +
> Hive? If so, should I be storing all this information in Cassandra then
> using Hadoop to retrieve it?
>
> Thanks for your suggestions
>

Mime
View raw message