mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen <>
Subject Re: how to prepare data efficiently for mahout
Date Thu, 05 Jan 2012 21:20:29 GMT
Thanks for the information.

I went through org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
and It looks like its job is to turn <doc_id,
content> sequence files to <doc_id, tf_vector> sequence files. I don't
understand why it saves some  temporal files several times. I think it
follows the procedure below. The result after each transformation is
saved which I think it is unnecessary.

<doc_id, content> => <doc_id, List<String>> => <word, wordcount>
<word, integer_id> => <doc_id, tf_vector>

If the content of the document is small enough, say several MB which
is true for most plain text documents, would that be  better to put
the above procedure in memory? That is read <doc_id, content> from
somewhere (Cassandra in my case), then proceed the tf_vector
calculation entirely in memory and dump the final result to some

On Sat, Dec 31, 2011 at 1:50 PM, Sean Owen <> wrote:
> You might get some mileage out of this article I wrote about using
> Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:
> On Sat, Dec 31, 2011 at 10:36 AM, Allen <> wrote:
>> Hello there,
>> I am new to Mahout and trying to get Mahout running on our data
>> storage -- Cassandra. After poking around the LDA example on reuters
>> data, I have several questions.
>> 1) Where is the source code for seqdirectory and seq2sparse?
>> 2) Before the algorithm can run, it looks like the raw text must be
>> converted and materialized into a sequece file which represents some
>> vectors. Is that true? If so, is there an more efficient way to handle
>> the conversion like streaming the data? In my project, all the data is
>> in Cassandra. If I need to run some Mahout algorithm, it seems I need
>> to get the data out, put them into a temporal directory in HDFS,
>> convert them into sequence file and finally turn them into tf-vectors
>> format in HDFS. Then I can run the algorithm. 2 temporal data are
>> stored in the above procedure which will make the run slow.
>> Many thanks.
>> --
>> Allen


View raw message