mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Have a idea of leveraging hbase for machine learning
Date Mon, 16 Nov 2009 13:34:41 GMT

On Nov 16, 2009, at 3:54 AM, Jeff Zhang wrote:

> Hi all,
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and vector
> which is very fit to be stored in hbase.
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store the
> document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
> This is currently what I think of, maybe there's something not correct, Hope
> to hear ideas from experts.

If you check out the classification algorithms in Mahout, they have HBase as a storage option.
 Feedback on them would be appreciated.

I tend to think about being agnostic of underlying storage as much as possible. 


View raw message