We are in production with 0.6. We started with this and haven't had time to figure out how to upgrade smoothly. It's on the horizon though; there's loads of features we really could do with in 0.7.

In terms of strategy, we don't currently follow Tyler's suggestions. I can't see any reason why we _wouldn't_ want to do this. However when we first implemented Cassandra, the big issue was implementing a data store that would handle a lot of updates to profiles and handling low-latency reads on-demand (both when you have a large number of users). Right now we use a bunch of different systems to generate the profiles including making use of Amazon EMR (via Hive). All of this is subject to change soon though!

We do use Hadoop a lot to carry out analysis on the profiles.

It would be great to hear updates as and when you implement your system. If you're ever in London, you could even present them at the Cassandra meetup! http://meetup.com/Cassandra-London


On 1 March 2011 17:16, Dave Viner <daveviner@gmail.com> wrote:
Hi Dave,

Glad to hear others are using it in this fashion!

Are you using Tyler's suggested strategy for user-profile data - one CF that stores the "timeline", with rows of user-ids, and TimeUUID columns for each data-collection-time.  Then some post-processing with Hadoop over the timelines for each user to build a "Profile"?  

Are you on 0.7 or 0.6.x?  

Dave Viner

On Tue, Mar 1, 2011 at 1:31 AM, Dave Gardner <dave.gardner@visualdna.com> wrote:

Tyler's answer already covers CFs etc..

We are using Cassandra to store user profile data for exactly the sort of use case you describe. We don't yet store _all_ the data in Cassandra; currently we are focusing on the stuff we need available for real-time access. We use Hadoop to analyse the profiles from within Cassandra.


On 23 February 2011 23:21, Dave Viner <daveviner@gmail.com> wrote:
Hi all,

I'm wondering if anyone has used cassandra as a datastore for a user-profile service.  I'm thinking of applications like behavioral targeting, where there are lots & lots of users (10s to 100s of millions), and lots & lots of data about them intermixed in, say, weblogs (probably TBs worth).  The idea would be to use Cassandra as a datastore for distributed parallel processing of the TBs of files (say on hadoop).  Then the resulting user-profiles would be query-able quickly.

Anyone know of that sort of application of Cassandra?  I'm trying to puzzle out just what the column family might look like.  Seems like a mix of time-oriented information (user x visits site y at time z), location information (user x appeared from ip x.y.z.a which is geo-location 31.20309, 120.10923), and derived information (because user x visited site y 15 times within a 10 day window, user x must be interested in buying a car).

I don't have specifics as yet... just some general thoughts.  But this feels like a Cassandra type problem.  (User profile can have lots of columns per user, but the exact columns might differ from user to user... very scalable, etc)

Dave Viner