From user-return-11667-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Sun Jan 1 01:15:46 2012 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28A18BD34 for ; Sun, 1 Jan 2012 01:15:46 +0000 (UTC) Received: (qmail 15395 invoked by uid 500); 1 Jan 2012 01:15:45 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 15363 invoked by uid 500); 1 Jan 2012 01:15:45 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 15352 invoked by uid 99); 1 Jan 2012 01:15:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 01 Jan 2012 01:15:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of goksron@gmail.com designates 209.85.210.170 as permitted sender) Received: from [209.85.210.170] (HELO mail-iy0-f170.google.com) (209.85.210.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 01 Jan 2012 01:15:40 +0000 Received: by iafj26 with SMTP id j26so54254398iaf.1 for ; Sat, 31 Dec 2011 17:15:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ya3a5jJLYbgQEtfCJ8Ib5O2r0fwCZXV1cIeFZ0Xd0hA=; b=K0d+KF6323OwPF1eQdogaP/Q44Bq8IMlEUsjzGCuaJpS+LU8mFoVxquEvA1VEHfMNZ 91Z9ZYuXyXISa5s88He+Pxxa2GBqSL84SR69eNdq4a4w9y4rPsUK+jmk722FjoQJNM7Z +l7C5pC/VSzjzX5HyNxL6PNKLz08u160X6JJw= MIME-Version: 1.0 Received: by 10.50.188.166 with SMTP id gb6mr52274343igc.18.1325380519373; Sat, 31 Dec 2011 17:15:19 -0800 (PST) Received: by 10.50.203.71 with HTTP; Sat, 31 Dec 2011 17:15:19 -0800 (PST) In-Reply-To: References: Date: Sat, 31 Dec 2011 17:15:19 -0800 Message-ID: Subject: Re: how to prepare data efficiently for mahout From: Lance Norskog To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 Hector is a more industrial-strength client for Cassandra. I have not used it. https://github.com/rantav/hector On Sat, Dec 31, 2011 at 10:50 AM, Sean Owen wrote: > You might get some mileage out of this article I wrote about using > Cassandra as input for Hadoop/Mahout, though it's not specific to LDA: > > http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/ > > On Sat, Dec 31, 2011 at 10:36 AM, Allen wrote: > >> Hello there, >> >> I am new to Mahout and trying to get Mahout running on our data >> storage -- Cassandra. After poking around the LDA example on reuters >> data, I have several questions. >> >> 1) Where is the source code for seqdirectory and seq2sparse? >> >> 2) Before the algorithm can run, it looks like the raw text must be >> converted and materialized into a sequece file which represents some >> vectors. Is that true? If so, is there an more efficient way to handle >> the conversion like streaming the data? In my project, all the data is >> in Cassandra. If I need to run some Mahout algorithm, it seems I need >> to get the data out, put them into a temporal directory in HDFS, >> convert them into sequence file and finally turn them into tf-vectors >> format in HDFS. Then I can run the algorithm. 2 temporal data are >> stored in the above procedure which will make the run slow. >> >> Many thanks. >> >> -- >> Allen >> -- Lance Norskog goksron@gmail.com