mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Mahout Clustering - integration with HBase
Date Fri, 11 Mar 2011 18:16:09 GMT
On Fri, Mar 11, 2011 at 4:06 AM, Michael Kurze <> wrote:

> > Updates can be pretty fast as well (I have seen nearly 200K small updates
> per second on half a dozen nodes).
> Yes, our table schema is optimized for scanning, that is among the reasons
> we picked HBase vs. simple key/value stores. A second reason would be that
> we also have a large number of small collections to cluster (user feedback
> about individual websites being broken in Firefox, each 10s to 1000s of
> small docs). Since that is more close to random access than streaming, HBase
> seems like a good idea. For those collections, ideally we would talk to
> Mahout in a way so that it does not hit HDFS at all.

That should be quite doable.

Your key design will be critical.  The figure of merit is, of course, useful
rows scanned per second.  That will be affected positively by including more
regions in the scan and negatively by having lower density of interesting
rows in the regions being scanned.  If you could cluster all collections in
the same map-reduce job, you have the ideal situation because essentially
all rows of your dataset are live and useful.  At that point, I would
consider arranging the key so that a single collection is contiguous unless
you have to load a single collection at a time at high speed in which case,
I would arrange the key to spread collections.

You will need to adapt the current clustering code so that it understands
that multiple clusterings are going on simultaneously and so that it can
vectorize your rows as you read them.  Vectorization may be expensive enough
that you would rather store feature vectors in hbase.

The two phases of k-means iterations (assignment to cluster, computation of
cluster centroids) can easily be expressed in terms of map-reduce and you
have strong example code already.  I am not quite sure the current state of
play relative to how easily you can serialize cluster descriptions, but that
would be key.

Your numbers on updates are very encouraging, that’s about the cluster size
> we’re starting with!

Contact me off-list for details about how to do this.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message