mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: how to prepare data efficiently for mahout
Date Sun, 01 Jan 2012 01:15:19 GMT
Hector is a more industrial-strength client for Cassandra. I have not used it.

On Sat, Dec 31, 2011 at 10:50 AM, Sean Owen <> wrote:
> You might get some mileage out of this article I wrote about using
> Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:
> On Sat, Dec 31, 2011 at 10:36 AM, Allen <> wrote:
>> Hello there,
>> I am new to Mahout and trying to get Mahout running on our data
>> storage -- Cassandra. After poking around the LDA example on reuters
>> data, I have several questions.
>> 1) Where is the source code for seqdirectory and seq2sparse?
>> 2) Before the algorithm can run, it looks like the raw text must be
>> converted and materialized into a sequece file which represents some
>> vectors. Is that true? If so, is there an more efficient way to handle
>> the conversion like streaming the data? In my project, all the data is
>> in Cassandra. If I need to run some Mahout algorithm, it seems I need
>> to get the data out, put them into a temporal directory in HDFS,
>> convert them into sequence file and finally turn them into tf-vectors
>> format in HDFS. Then I can run the algorithm. 2 temporal data are
>> stored in the above procedure which will make the run slow.
>> Many thanks.
>> --
>> Allen

Lance Norskog

View raw message