Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 50487 invoked from network); 18 Sep 2009 06:33:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Sep 2009 06:33:06 -0000 Received: (qmail 15865 invoked by uid 500); 18 Sep 2009 06:33:05 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 15823 invoked by uid 500); 18 Sep 2009 06:33:05 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 15813 invoked by uid 99); 18 Sep 2009 06:33:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Sep 2009 06:33:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.219.228] (HELO mail-ew0-f228.google.com) (209.85.219.228) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Sep 2009 06:32:54 +0000 Received: by ewy28 with SMTP id 28so1095183ewy.28 for ; Thu, 17 Sep 2009 23:32:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.211.174.15 with SMTP id b15mr634165ebp.22.1253255551786; Thu, 17 Sep 2009 23:32:31 -0700 (PDT) In-Reply-To: References: <6475fa040909170036u24a49a76id8e7f531ef07abbd@mail.gmail.com> <20090917112307.2803a0dd@strausberg.neofonie.priv> <6475fa040909170431k3ea804ffoe7f48c5ab18bad49@mail.gmail.com> <022649A5CB03E94D9D95C622302400EA6915B2@UKDCDX01.cbs.ad.cbs.net> <46C09E0F-AD30-4FB3-A760-BAD589D8B484@apache.org> <6475fa040909171159n4078afech680d0963bad0f52a@mail.gmail.com> Date: Fri, 18 Sep 2009 08:32:31 +0200 Message-ID: <6475fa040909172332g4dea807cpff788409fb5ad920@mail.gmail.com> Subject: Re: Some basic introductory questions From: Aleksander Stensby To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502c5fffa1f2e0473d44b83 X-Virus-Checked: Checked by ClamAV on apache.org --00504502c5fffa1f2e0473d44b83 Content-Type: text/plain; charset=ISO-8859-1 Of course, I'm happy to. You should probably add a few follow-up questions to questions like: Do you currently use or develop with Mahout? - if i answer yes, but not in production - but I plan on using it in production:) Same goes for the second question:) As for the last question, "standalone batch programs with defined file-based inputs and outputs" is obviously "acceptable" to me, but ideally I would like the second and third option. Cheers, Aleks On Thu, Sep 17, 2009 at 11:02 PM, Ted Dunning wrote: > Aleksander, > > As a (temporarily) naive user of the system, you are in a special position > to answer a few use-case questions. Because I think that we need to > collect > some of these impressions, I have created a simple form with less than a > dozen questions about intended use and preferred shape of the software. > > Could you go to the URL below to answer those questions? > > > http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA > .. > > On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby < > aleksander.stensby@integrasco.com> wrote: > > > Thanks for all the replies guys! > > I understand the flow of things and it makes sense, but like Shawn > pointed > > out there could still be more abstraction (and once I get my hands dirty > > I'll try to do my best to contribute here aswell:) ) > > > > And to Levy: your proposed flow of things makes sense, but what I wanted > > was > > to do all that from one entry point. (Ideally, I don't want to do manual > > stuff here, I want everything to be able to run on a regular basis from a > > single entrypoint - and then I mean any algorithm etc). And I can > probably > > do that just fine by using the Drivers etc. > > > > Again, thanks for the replies! > > > > Cheers, > > Aleks > > > > On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll > >wrote: > > > > > > > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote: > > > > > > Hi Aleksander, > > >> > > >> I've also been learning how to run mahout's clustering and LDA on our > > >> cluster. > > >> > > >> For k-means, the following series of steps has worked for me: > > >> > > >> * build mahout from trunk > > >> > > >> * write a program to convert your data to mahout Vectors. You can > base > > >> this on one of the Drivers in the mahout.utils.vectors package (which > > >> seem designed to work locally). For bigger datasets you'll probably > > >> need to write a simple map reduce job, more like > > >> mahout.clustering.syntheticcontrol.canopy.InputDriver. In either > event > > >> your Vectors need to end up on the dfs. > > >> > > > > > > Yeah, they are designed for local so far, but we should work to extend > > > them. I think as Mahout matures, this problem will become less and > less. > > > Ultimately, I'd like to see utilities that simply ingest whatever is > up > > on > > > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a > _long_ > > > way off, unless someone wants to help drive that. > > > > > > Those kinds of utilities would be great contributions from someone > > looking > > > to get started contributing. As I see it, we could leverage Apache > Tika > > > with a M/R job to produce the appropriate kinds of things for our > various > > > algorithms. > > > > > > > > >> * run clustering with > org.apache.mahout.clustering.kmeans.KMeansDriver, > > >> something like: > > >> hadoop jar mahout-core-0.2-SNAPSHOT.job > > >> org.apache.mahout.clustering.kmeans.KMeansDriver -i > /dfs/input/data/dir > > >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k > > >> -x > > >> > > >> * possibly fix the problem described here > > >> > > http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run > > >> -of-KMeans-td24505889.html (solution is at the bottom of the page) > > >> > > >> * get all the output files locally > > >> > > >> * convert the output to text format with > > >> org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer > to > > >> do this on the cluster, but the code seems to expect local files. If > > >> you set the name field in your input Vectors in the conversion step to > a > > >> suitable ID, then the final output can be a set of cluster centroids, > > >> each followed by the list of Vector IDs in the corresponding cluster. > > >> > > >> Hope this is useful. > > >> > > >> More importantly, if anything here is very wrong then please can a > > >> mahout person correct me! > > >> > > > > > > Looks good to me. Suggestions/patches are welcome! > > > > > > > > > > > > -- > > Aleksander M. Stensby > > Lead Software Developer and System Architect > > Integrasco A/S > > E-mail: aleksander.stensby@integrasco.com > > Tel.: +47 41 22 82 72 > > www.integrasco.com > > http://twitter.com/Integrasco > > http://facebook.com/Integrasco > > > > Please consider the environment before printing all or any of this e-mail > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Aleksander M. Stensby Lead Software Developer and System Architect Integrasco A/S www.integrasco.com http://twitter.com/Integrasco http://facebook.com/Integrasco Please consider the environment before printing all or any of this e-mail --00504502c5fffa1f2e0473d44b83--