From mahout-user-return-3048-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Tue Apr 06 04:42:48 2010 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 93134 invoked from network); 6 Apr 2010 04:42:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Apr 2010 04:42:47 -0000 Received: (qmail 86845 invoked by uid 500); 6 Apr 2010 04:42:47 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 86673 invoked by uid 500); 6 Apr 2010 04:42:47 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 86665 invoked by uid 99); 6 Apr 2010 04:42:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 04:42:46 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of toby.doig@gmail.com designates 72.14.220.158 as permitted sender) Received: from [72.14.220.158] (HELO fg-out-1718.google.com) (72.14.220.158) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 04:42:36 +0000 Received: by fg-out-1718.google.com with SMTP id d23so810647fga.5 for ; Mon, 05 Apr 2010 21:42:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:received:message-id :subject:from:to:content-type; bh=XmgiD0/E5WP6NYmjG6PZL98cczh90s8R3iTzbg0bXsQ=; b=kGYLpDk5xFJMOG3ihCrvn7pEOTgXLaa70o4Ku1BHMZZYH78uaKVVhL6naQESWVAhiA pd8MMtJ+KEXOmOBs8oM9GlJk9JWB6SFAnNh0U18ah2YK83QcR1u44ot1h+d0iZcHA9Tq WuAB6VtVOin0mlL/yDzR2Zbo0ofp9cVeNaNX0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=NVnJ8ETWTl3Otm2JJueD5O0PsNPVjVU150ZvvIwiqtIbjn7Kjvp+SPedCxmT8QKIte aSHUhPqxcK7t2BNA8H/SG7RgjUQ+fhdwgNp3Wu2770fxPb96FuDrqD34Wx0Nxu1ZGNkL 6PhDCq6TtrICLYAb9+5/jl8/GfyKKGFt994g0= MIME-Version: 1.0 Received: by 10.86.86.13 with HTTP; Mon, 5 Apr 2010 21:42:15 -0700 (PDT) Date: Tue, 6 Apr 2010 00:42:15 -0400 Received: by 10.87.15.40 with SMTP id s40mr10220129fgi.44.1270528935892; Mon, 05 Apr 2010 21:42:15 -0700 (PDT) Message-ID: Subject: clustering your data with dirichlet issue From: Toby Doig To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001485f77256e6987304838a11d7 X-Virus-Checked: Checked by ClamAV on apache.org --001485f77256e6987304838a11d7 Content-Type: text/plain; charset=ISO-8859-1 I've run dirichlet commandline and now have an output folder with some state-0, state-1, ... state-5 folders which each contain part-00000 and .part-00000.crc files. However the ClusteringYourData wiki page's Retrieving the Output section just says TODO. I don't know how to turn those part files into something useful. http://cwiki.apache.org/MAHOUT/clusteringyourdata.html I successfully ran the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which outputted data as text (to console at least) so I tried ripping the printResults() methods from that class and putting them in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail. Can someone help? Also, when running the commandline job it asks for the prototypeSize (-s param) so when I converted my Lucene index to a vector file the output said it created 11 vectors, but with i specified that value for prototypeSize the job failed saying it found 1793 vectors. Changing the value i specify to 1793 works but i now wonder why i need to specify it if it can figure it out? Could it not be optional? --001485f77256e6987304838a11d7--