hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinay Seth Mohta <>
Subject Implementing cohort analysis
Date Wed, 03 Feb 2010 19:10:33 GMT

I've been thinking of using Hadoop/Hive to do cohort analysis on a
large data set.  The general structure of the problem is:
- define a cohort of users using some criteria (e.g. users who visited
on the 2010-01-10, users who visited an experimental landing page,
- track their behavior over time (e.g. users who visited landing page
v2 had twice as many sessions in the following 10 days than users who
visited landing page v1)

While I've seen many folks include bullets in presentations that
indicate that they use Hadoop/Hive for cohort analysis, I haven't seen
any good examples for how they implement it (at least not via Google).

The two approaches that I see are:

1) use hadoop streaming:
    - first, run a map-only job to filter out the cohort and create an
output file with all user ids that you want to track
    - second map/reduce job that clusters by user where the mapper
filters the data to only the user ids identified in the previous job
(e.g. via a hash lookup) and the reduce computes behavior for this
subset of users

2) use Hive with a WHERE clause to limit the cohort and then use
UDAF's so that you get all rows for the user and then in the UDAF,
implement desired functionality in the iterator.

Are there other / better ways to do cohort analysis?  Anyone have
examples they're willing to post or point me to?

Thanks in advance,

View raw message