Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 4378 invoked from network); 17 Aug 2010 21:06:38 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Aug 2010 21:06:38 -0000 Received: (qmail 14223 invoked by uid 500); 17 Aug 2010 21:06:38 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 14088 invoked by uid 500); 17 Aug 2010 21:06:37 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 14080 invoked by uid 99); 17 Aug 2010 21:06:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Aug 2010 21:06:37 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.208.4.195] (HELO mout.perfora.net) (74.208.4.195) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Aug 2010 21:06:30 +0000 Received: from jeff-eastmans-macbook-pro.local (c-71-198-0-148.hsd1.ca.comcast.net [71.198.0.148]) by mrelay.perfora.net (node=mrus3) with ESMTP (Nemesis) id 0MS61S-1OIDO20vZC-00TLvJ; Tue, 17 Aug 2010 17:06:05 -0400 Message-ID: <4C6AF9BB.4060004@windwardsolutions.com> Date: Tue, 17 Aug 2010 14:06:03 -0700 From: Jeff Eastman User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: dev@mahout.apache.org Subject: Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures References: <31739530.331811281711259409.JavaMail.jira@thor> <76598.396061282062376620.JavaMail.jira@thor> In-Reply-To: Content-Type: multipart/mixed; boundary="------------070000000604000107050109" X-Provags-ID: V02:K0:hS7ZH6HIP+9ePwnRZrjirGCqR2Lto78eRm138iRmnFw WRdkZv1PtxuiKdXBlt7cUOMR+62hYp8TTYLlOQie5qzC2aY4sp 0YNFFRC0iqma0Wvt4ZO2O/ciQpAL90+n1Z/rMpnP/+Tcc/zBS8 uD+z1iD+oAV16zcRdFoVwKmGTfyqb6QZc977yNMwGMQr8XLJKZ UlTCAMzf17bEaxiAzS2Z0LpnkC/8QuaUk4lz5LCV8U= --------------070000000604000107050109 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi Ted, I've made significant progress on both of these issues in commits late last month. At the command level, all the clustering drivers now inherit from AbstractJob and have their common parameters factored into the DefaultOptionCreator for API consistency. All drivers also support a buildClusters() method which processes input vectors to produce their respective Cluster models in persistent storage (clusters-i directory), plus a clusterData() method that reads those models and performs actual clustering of the input vectors. The sequence files in the clusters-i directories can be read uniformly by the ClusterDumper and other utilities as they all support the Cluster interface. The clusterData() process for most algorithms produces a single, most-likely cluster assignment, usually the closest cluster. For Dirichlet and FuzzyK, the clustering can be specified to use the most-likely assignment (the default) or a pdf threshold can be specified above which multiple cluster assignments will be output. All clusterData() processes produce WeightedVectorWritable objects in persistent storage which contain a probability weight and the input vector. These sequence files are keyed by the clusterId and are output to the clusteredPoints directory. The buildClusters() step is always run from the command line but the clusterData step is optional (-cl flag). It would be straightforward to support the other use case (clusterData only). Users who instantiate the drivers from Java code can call either/both at their discretion now. I've also implemented an execution method (-xm) parameter on all clustering drivers which allows the sequential, in-memory reference implementation to be invoked from the command line using the same arguments as the mapreduce implementation. The display examples use these now, except Dirichlet which I didn't get to before I left. Given this information, what do you now see as logical next steps? Jeff On 8/17/10 9:31 AM, Ted Dunning wrote: > Jeff, > > You asked about clustering things to do. > > In my mind, there are two clustering issues. One is unification at the > command level where clusters are learned. The other is unification in > subsequent steps where somebody might want to use a clustering. The second > issue actually seems a bit more pressing to me. > > That second issue concerns the ability to have a model that is the output of > the clustering. That model should support: > > - reading the model from persistent storage > > - classifying new vectors to get either a single best-fit cluster or a score > vector. > > > In my view, this should apply equally to all classifiers and the models > produced by classifier learning algorithms should be the same at the > interface level as the models produced by cluster learning algorithms. > > > On Tue, Aug 17, 2010 at 9:26 AM, Ted Dunning (JIRA) wrote: > >> [ >> https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899452#action_12899452] >> >> Ted Dunning commented on MAHOUT-479: >> ------------------------------------ >> >> I just moved the encoding objects associated with MAHOUT-228 to >> org.apache.mahout.vectors to provide a nucleus for feature encoding. >> >> There are also a fair number of things in oam.text and oam.utils that are >> related. Since those are in the utils module, however, I couldn't leverage >> them. We may want to consider moving some of them to core to allow wider >> use. >> >>> Streamline classification/ clustering data structures >>> ----------------------------------------------------- >>> >>> Key: MAHOUT-479 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-479 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Classification, Clustering >>> Affects Versions: 0.1, 0.2, 0.3, 0.4 >>> Reporter: Isabel Drost >>> >>> Opening this JIRA issue to collect ideas on how to streamline our >> classification and clustering algorithms to make integration for users >> easier as per mailing list thread >> http://markmail.org/message/pnzvrqpv5226twfs >>> {quote} >>> Jake and Robin and I were talking the other evening and a common lament >> was that our classification (and clustering) stuff was all over the map in >> terms of data structures. Driving that to rest and getting those comments >> even vaguely as plug and play as our much more advanced recommendation >> components would be very, very helpful. >>> {quote} >>> This issue probably also realates to MAHOUT-287 (intention there is to >> make naive bayes run on vectors as input). >>> Ted, Jake, Robin: Would be great if someone of you could add a comment on >> some of the issues you discussed "the other evening" and (if applicable) any >> minor or major changes you think could help solve this issue. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> --------------070000000604000107050109--