mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: how to prepare data efficiently for mahout
Date Sun, 01 Jan 2012 01:15:19 GMT
Hector is a more industrial-strength client for Cassandra. I have not used it.

https://github.com/rantav/hector

On Sat, Dec 31, 2011 at 10:50 AM, Sean Owen <srowen@gmail.com> wrote:
> You might get some mileage out of this article I wrote about using
> Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:
>
> http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/
>
> On Sat, Dec 31, 2011 at 10:36 AM, Allen <an.ronaldor@gmail.com> wrote:
>
>> Hello there,
>>
>> I am new to Mahout and trying to get Mahout running on our data
>> storage -- Cassandra. After poking around the LDA example on reuters
>> data, I have several questions.
>>
>> 1) Where is the source code for seqdirectory and seq2sparse?
>>
>> 2) Before the algorithm can run, it looks like the raw text must be
>> converted and materialized into a sequece file which represents some
>> vectors. Is that true? If so, is there an more efficient way to handle
>> the conversion like streaming the data? In my project, all the data is
>> in Cassandra. If I need to run some Mahout algorithm, it seems I need
>> to get the data out, put them into a temporal directory in HDFS,
>> convert them into sequence file and finally turn them into tf-vectors
>> format in HDFS. Then I can run the algorithm. 2 temporal data are
>> stored in the above procedure which will make the run slow.
>>
>> Many thanks.
>>
>> --
>> Allen
>>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message