Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 51170 invoked from network); 5 Mar 2011 22:56:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Mar 2011 22:56:56 -0000 Received: (qmail 30854 invoked by uid 500); 5 Mar 2011 22:56:56 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 30804 invoked by uid 500); 5 Mar 2011 22:56:56 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 30796 invoked by uid 99); 5 Mar 2011 22:56:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Mar 2011 22:56:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.212.42 as permitted sender) Received: from [209.85.212.42] (HELO mail-vw0-f42.google.com) (209.85.212.42) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Mar 2011 22:56:49 +0000 Received: by vws10 with SMTP id 10so5378522vws.1 for ; Sat, 05 Mar 2011 14:56:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=/AA6nbG+G0wbbI5tLnsuGspB03k6zzrA0jNmjR1WpVE=; b=WPXsn4wyrbBUqpgCU+aGHL4hlnjUC3Vc8DYUzotuD12koeS8WM3QFN9SdjdO+k09a9 KHI3TGVO7P4VnkEAorUZsOnBFl/fSde1tSXc/LwqcyLuBhwIMgcE50DNPLOMyCLUVJIL tXZ5TZwArKzir8jJWmjh7svdo8KfwlibyVokk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=ZzjFS0Dw8KNn3ouBfDC6mXjURsWM1X65fcGL5S6Q7DUPDFoaP+2zQcUT0KBtrKETp8 mxnjlywLQeD4knTNIN2o8C+q0XGzjFcyMeYtbSJqEcIVyuOIjAWuLa0Q1YaF53N6//5z 9r44eBK6QcLXd1lblNbAPVG2zmuBdm7YusuG0= Received: by 10.52.100.67 with SMTP id ew3mr3139769vdb.229.1299365788155; Sat, 05 Mar 2011 14:56:28 -0800 (PST) MIME-Version: 1.0 Received: by 10.52.167.37 with HTTP; Sat, 5 Mar 2011 14:56:08 -0800 (PST) In-Reply-To: References: From: Ted Dunning Date: Sat, 5 Mar 2011 14:56:08 -0800 Message-ID: Subject: Re: Reentering at the ground floor To: user@mahout.apache.org Cc: Benson Margulies Content-Type: multipart/alternative; boundary=20cf307f349a3c7e03049dc42c81 --20cf307f349a3c7e03049dc42c81 Content-Type: text/plain; charset=UTF-8 Quickstart: https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart JIRA's with recent activity: https://issues.apache.org/jira/browse/MAHOUT-588 https://issues.apache.org/jira/browse/MAHOUT-551 https://issues.apache.org/jira/browse/MAHOUT-390 Chapters 6-12 of MiA (conflict of interest alert!) Hashed vector encoding https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/package-summary.html This won't be as good as you would like in terms of fit and finish. All contributions toward that end are VERY welcome. On Sat, Mar 5, 2011 at 12:03 PM, Benson Margulies wrote: > I may have finally been handed a reason to make a serious attempt to > use mahout, and here I am more or less where I tried to start a very > long time ago. > > Imagine that someone else has gone and stuck a large number of text > docs into a hadoop file system. I want to > > a- convert them to feature vectors > b- run canopy+kmeans or some such clusterer > c- report back the assignment of docs to clusters > > Where should I start reading in the web site? > --20cf307f349a3c7e03049dc42c81--