mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Mahout Clustering - integration with HBase
Date Thu, 10 Mar 2011 23:20:51 GMT
Actually hbase is really good at scanning.

Updates can be pretty fast as well (I have seen nearly 200K small updates
per second on half a dozen nodes).

You can achieve even higher write rates if you create hfiles off-line and
then convert them into tables administratively instead of writing to live
tables.

On Thu, Mar 10, 2011 at 1:29 PM, Robin Anil <robin.anil@gmail.com> wrote:

> > Browsing through "Mahout in Action" I saw a nice example that gets you
> > started with updateable clustering (Listing 9.4 in the 1/19 snapshot). My
> > question now is: are there any ideas or even best practices on reading
> and
> > writing documents, vectors and clusters directly from and to HBase rather
> > than the file system (HDFS).
> >
>
> Clustering is a very expensive process and hdfs is designed to read data
> sequentially just the way kmeans likes it.
> So given that reading and writing files of HBASE is going to be very
> expensive for repeated iterations, as compared to one time dump, cluster,
> write cluster info back into hbase (maybe into a different column family)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message