mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Fri, 26 Jun 2009 16:21:39 GMT
Thanks for the fast response, Grant.

I am aware of what you pointed out about Taste. I just mentioned it to
make a reference to something similar to what I needed to
implement/use, namely the "DataModel" interface.

I'm going to try the solution you suggested and write an
implementation of VectorIterable. Expect me to come back here for
feedback.

My thanks again.

On Fri, Jun 26, 2009 at 12:41 PM, Grant Ingersoll<gsingers@apache.org> wrote:
>
> On Jun 26, 2009, at 10:20 AM, nfantone wrote:
>
>> Hi to you all, Mahout users. I'm new to the list and to Mahout itself
>> and I'm trying to integrate Taste to my project in which I need to
>> cluster user data from a very large data set, based on their behavior
>> which is stored in some tables in a local data base. From what I've
>> read and experimented, clustering in Mahout takes advantage of HDFS
>> and Lucene indexing, converting plain CSV files to Vectors. So, I ask:
>> is it mandatory to create plain text files (or HDFS files) and indexes
>> from the data in my DB so as to feed clustering algorithm's input?
>> Couldn't I create, somehow, the Vectors directly and then use them to
>> initiate the clustering jobs? Is there any convenient way to achieve
>> this? I've not seen anything similar to the "DataModel" interface used
>> by Recommenders for JDBC connection (or any other connectivity API)
>> and the runJob static methods receive paths for both input and output
>> which, a priori, I don't have any use for. Documentation wasn't
>> helpful either as the "From a Database" section of "Creating Vectors
>> from Text" is currently empty.
>
> The clustering algorithms (on trunk) expect the input file to be a Hadoop
> SequenceFile of <Writable, Vector>
>
> The utils module, contains an interface named VectorIterable which could
> easily be implemented to work with a JDBC connection.  There is an
> implementation of this for Lucene (LuceneIterable).  However, it is likely
> just as easy to write your own ResultSet loop that takes from your DB and
> outputs the SequenceFile.  There are SequenceFile.Writer examples in several
> places in the utils module.  See the Driver class in the utils module for
> example.
>
> Also, FYI, Taste is a separate from what you seem to be implying you want to
> do.  Taste is a collaborative filtering engine that lives in Mahout.  Mahout
> also has several clustering implementations like k-Means, Canopy, Dirichlet,
> etc.
>

Mime
View raw message