cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Gardner <>
Subject Re: cassandra as user-profile data store
Date Thu, 03 Mar 2011 17:46:34 GMT

We are in production with 0.6. We started with this and haven't had time to
figure out how to upgrade smoothly. It's on the horizon though; there's
loads of features we really could do with in 0.7.

In terms of strategy, we don't currently follow Tyler's suggestions. I can't
see any reason why we _wouldn't_ want to do this. However when we first
implemented Cassandra, the big issue was implementing a data store that
would handle a lot of updates to profiles and handling low-latency reads
on-demand (both when you have a large number of users). Right now we use a
bunch of different systems to generate the profiles including making use of
Amazon EMR (via Hive). All of this is subject to change soon though!

We do use Hadoop a lot to carry out analysis on the profiles.

It would be great to hear updates as and when you implement your system. If
you're ever in London, you could even present them at the Cassandra meetup!


On 1 March 2011 17:16, Dave Viner <> wrote:

> Hi Dave,
> Glad to hear others are using it in this fashion!
> Are you using Tyler's suggested strategy for user-profile data - one CF
> that stores the "timeline", with rows of user-ids, and TimeUUID columns for
> each data-collection-time.  Then some post-processing with Hadoop over the
> timelines for each user to build a "Profile"?
> Are you on 0.7 or 0.6.x?
> Dave Viner
> On Tue, Mar 1, 2011 at 1:31 AM, Dave Gardner <>wrote:
>> Dave
>> Tyler's answer already covers CFs etc..
>> We are using Cassandra to store user profile data for exactly the sort of
>> use case you describe. We don't yet store _all_ the data in Cassandra;
>> currently we are focusing on the stuff we need available for real-time
>> access. We use Hadoop to analyse the profiles from within Cassandra.
>> Dave
>> On 23 February 2011 23:21, Dave Viner <> wrote:
>>> Hi all,
>>> I'm wondering if anyone has used cassandra as a datastore for a
>>> user-profile service.  I'm thinking of applications like behavioral
>>> targeting, where there are lots & lots of users (10s to 100s of millions),
>>> and lots & lots of data about them intermixed in, say, weblogs (probably
>>> worth).  The idea would be to use Cassandra as a datastore for distributed
>>> parallel processing of the TBs of files (say on hadoop).  Then the resulting
>>> user-profiles would be query-able quickly.
>>> Anyone know of that sort of application of Cassandra?  I'm trying to
>>> puzzle out just what the column family might look like.  Seems like a mix of
>>> time-oriented information (user x visits site y at time z), location
>>> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
>>> 120.10923), and derived information (because user x visited site y 15 times
>>> within a 10 day window, user x must be interested in buying a car).
>>> I don't have specifics as yet... just some general thoughts.  But this
>>> feels like a Cassandra type problem.  (User profile can have lots of columns
>>> per user, but the exact columns might differ from user to user... very
>>> scalable, etc)
>>> Thanks
>>> Dave Viner

View raw message