mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kurze <>
Subject Re: Mahout Clustering - integration with HBase
Date Fri, 11 Mar 2011 12:06:23 GMT
Robin and Ted, thanks for your feedback!

On Mar 11, 2011, at 12:20 AM, Ted Dunning wrote:

> Actually hbase is really good at scanning.
> Updates can be pretty fast as well (I have seen nearly 200K small updates per second
on half a dozen nodes).   

Yes, our table schema is optimized for scanning, that is among the reasons we picked HBase
vs. simple key/value stores. A second reason would be that we also have a large number of
small collections to cluster (user feedback about individual websites being broken in Firefox,
each 10s to 1000s of small docs). Since that is more close to random access than streaming,
HBase seems like a good idea. For those collections, ideally we would talk to Mahout in a
way so that it does not hit HDFS at all.

Your numbers on updates are very encouraging, that’s about the cluster size we’re starting

In general, the best (still generic) interface to Mahout for us would probably (roughly) be
one that accepts iterables over vectors and returns an iterator of (centroid, cluster). The
iterable passed in by clients could open an hbase scanner for each iteration (or whatever
an app needs to do to get data). From what I’ve seen in the KMeans source that might actually
not be too far off from what’s already there. The result iterator could then either be fed
from an HDFS file or (generator style) from the final stage of the algorithm as it is running.

To get things started, we’ll probably do the roundtrip to HDFS like outlined in the MIA
example. That’ll also give us a base of comparison to evaluate future enhancements.

> You can achieve even higher write rates if you create hfiles off-line and then convert
them into tables administratively instead of writing to live tables.

For now we plan to load the generated clusters online, since we will also have lots of smaller
collections that we want to update close to immediately. Loading tables off-line is still
an interesting suggestion, especially in a situation where we might change overall clustering
parameters and need to recompute everything.

> On Thu, Mar 10, 2011 at 1:29 PM, Robin Anil <> wrote:
> > Browsing through "Mahout in Action" I saw a nice example that gets you
> > started with updateable clustering (Listing 9.4 in the 1/19 snapshot). My
> > question now is: are there any ideas or even best practices on reading and
> > writing documents, vectors and clusters directly from and to HBase rather
> > than the file system (HDFS).
> >
> Clustering is a very expensive process and hdfs is designed to read data
> sequentially just the way kmeans likes it.
> So given that reading and writing files of HBASE is going to be very
> expensive for repeated iterations, as compared to one time dump, cluster,
> write cluster info back into hbase (maybe into a different column family)

View raw message