Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 57728 invoked from network); 14 Jan 2010 18:56:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Jan 2010 18:56:39 -0000 Received: (qmail 43686 invoked by uid 500); 14 Jan 2010 18:56:38 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 43617 invoked by uid 500); 14 Jan 2010 18:56:38 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 43607 invoked by uid 99); 14 Jan 2010 18:56:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 18:56:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bogdan.vatkov@gmail.com designates 74.125.78.26 as permitted sender) Received: from [74.125.78.26] (HELO ey-out-2122.google.com) (74.125.78.26) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 18:56:32 +0000 Received: by ey-out-2122.google.com with SMTP id d26so12395eyd.3 for ; Thu, 14 Jan 2010 10:56:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=5rudXD3eLJdIjXx7NsqqaCgIPfEMdMKJsb7XRaSjgZM=; b=VaYph+1zdTcpuRatxhZLKCrR+6Rdsyq6y05EUoyvvZFDyCDDVlkkFxQhKcI21ZrB+M uecLPQeJ8f4cB2DEaazy+l20G7M7//9RgmcJQFKMA0E02az1guBHHigHtJVYZTeE5k3I 04fz2hvtcLUJjmUp3dG70L454Cig9SYs3q24k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=XVcr6/4VcJBLSKW2paYhAnIiQR+H94mV8TGZjkYPavK1Q4/oOPl64VQaeFtY7Npztb DO6aOatDctaeGjCwm4wSszVO8GM3Nv3EPnvm9I2XTL0yAa6AGoK28WnkMGvDnnDbqcVL mArRDnw/9t0SY2UrWTtTvPR9+NnYbJauEYqzA= MIME-Version: 1.0 Received: by 10.213.96.214 with SMTP id i22mr2886902ebn.88.1263495370441; Thu, 14 Jan 2010 10:56:10 -0800 (PST) In-Reply-To: <4B4F672B.8020405@windwardsolutions.com> References: <263CFF4F-85F1-4F2E-A030-C2F50770A27E@apache.org> <4B4E55FB.4010709@windwardsolutions.com> <4B4E5C0B.10803@windwardsolutions.com> <4B4F672B.8020405@windwardsolutions.com> Date: Thu, 14 Jan 2010 20:56:10 +0200 Message-ID: Subject: Re: CardinalityException in DirichletDriver From: Bogdan Vatkov To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636c5a8debb034e047d247046 --001636c5a8debb034e047d247046 Content-Type: text/plain; charset=ISO-8859-1 unfortunately I am using private data which I cannot share. I am using emails, indexed by Solr and then creating vectors out of them. I am using them with k-means and everything is ok. Just wanted to try out the Dirichlet algorithm. On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman wrote: > I gather you are doing text clustering? Are you using one of our example > datasets or one which is publicly available? > > > > Bogdan Vatkov wrote: > >> Hi Jeff, >> >> What kind of details do you need to continue? >> In the mean time I am anyway going back to kmeans (maybe I really start >> with >> adding canopy to my kmeans only scenario first ;)). >> >> Best regards, >> Bogdan >> >> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman > >wrote: >> >> >> >>> I think KMeans and Canopy are the most-used and therefore the most >>> robust. >>> Dirichlet still has not seen much use beyond some test examples and >>> NormalModel has at least one known problem (with sample() only returning >>> the >>> maximum likelihood) that has been reported but never fixed. Can you point >>> me >>> to the problem you are running so I can try to get up to speed? It has >>> been >>> some time since I worked in this code but I'm keen to do so and I have >>> some >>> time to invest. >>> >>> Jeff >>> >>> >>> >>> Bogdan Vatkov wrote: >>> >>> >>> >>>> But I am the first one to use Dirichlet which algorithm is the >>>> recommended >>>> one? Are all other algs better then Dirichlet so no one used it ;)? >>>> >>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman < >>>> jdog@windwardsolutions.com >>>> >>>> >>>>> wrote: >>>>> >>>>> >>>> >>>> >>>> >>>>> The NormalModelDistribution seems to still think all the data vectors >>>>> are >>>>> size=2. In SampleFromPrior, it is creating models with that size. >>>>> Subsequently, when you calculate the pdf with your data value (x) the >>>>> sizes >>>>> are incompatible. Suggest changing 'DenseVector(2)' to >>>>> 'DenseVector(n)', >>>>> where n is your data cardinality. Please also look at the rest of the >>>>> math >>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try >>>>> to >>>>> use Dirichlet. >>>>> >>>>> >>>>> >>>>> Bogdan Vatkov wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> I see a stack when the size of the vectore mean is set to 2: >>>>>> >>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in >>>>>> NormalModel)) >>>>>> NormalModel.(Vector, double) line: 48 >>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33 >>>>>> DirichletState.(ModelDistribution, int, double, int, int) >>>>>> line: >>>>>> 48 >>>>>> DirichletDriver.createState(String, int, double) line: 172 >>>>>> DirichletDriver.writeInitialState(String, String, String, int, double) >>>>>> line: >>>>>> 150 >>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int) >>>>>> line: >>>>>> 133 >>>>>> DirichletDriver.main(String[]) line: 109 >>>>>> Clusters.doClustering() line: 244 >>>>>> Clusters.access$0(Clusters) line: 175 >>>>>> Clusters$1.run() line: 148 >>>>>> Thread.run() line: 619 >>>>>> >>>>>> >>>>>> public class NormalModelDistribution implements >>>>>> ModelDistribution >>>>>> { >>>>>> @Override public Model[] sampleFromPrior(int howMany) { >>>>>> Model[] result = new NormalModel[howMany]; for (int i = 0; i < >>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } >>>>>> return >>>>>> result; } >>>>>> >>>>>> and later this vector is dotted to >>>>>> @Override >>>>>> public double pdf(Vector x) { >>>>>> double sd2 = stdDev * stdDev; >>>>>> double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * >>>>>> sd2); >>>>>> double ex = Math.exp(exp); >>>>>> return ex / (stdDev * sqrt2pi); >>>>>> } >>>>>> >>>>>> x vector which is coming from Hadoop MapRunner through the map >>>>>> function: >>>>>> >>>>>> public void map(WritableComparable key, Vector v, >>>>>> OutputCollector output, Reporter >>>>>> reporter) >>>>>> throws IOException { >>>>>> >>>>>> >>>>>> any idea? >>>>>> >>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it >>>>>> safe >>>>>> enough to run against trunk? >>>>>> >>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov < >>>>>>> bogdan.vatkov@gmail.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> wrote: >>>>>>>> Sorry, what does that mean :)? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> It means that there is probably a programming bug somehow. At the >>>>>>> very >>>>>>> least, the program is not robust with respect to strange invocations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> what is a dotted vector? and why aren't they the same? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> dot product is a vector operation that is the sum of products of >>>>>>> corresponding elements of the two vectors being operated on. If >>>>>>> these >>>>>>> vectors don't have the same length, then it is an error. >>>>>>> >>>>>>> what should I investigate? >>>>>>> I am not familiar with the code, but if I had time to look, my >>>>>>> strategy >>>>>>> would be to start in the NormalModel and work back up the stack trace >>>>>>> to >>>>>>> find out how the vectors came to be different lengths. No doubt, the >>>>>>> code >>>>>>> in NormalModel will not tell you anything, but you can see which >>>>>>> vectors >>>>>>> are >>>>>>> involved and by walking up the stack you may be able to see where >>>>>>> they >>>>>>> come >>>>>>> from. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I am basically running my complete kmeans scenario (same input data, >>>>>>>> same >>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main >>>>>>>> step >>>>>>>> with a DirichletDriver.main call...of course the arguments are >>>>>>>> adjusted >>>>>>>> since kmeans and dirichlet do not have the same arguments. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I would think that this sounds very plausible. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I am not sure what number I should give for the alpha argument, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Alpha should have a value in the range from 0.01 to 20. I would scan >>>>>>> with >>>>>>> 1,2, 5 magnitude steps to see what works well for your data. (i.e. >>>>>>> 0.01, >>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20). A value of 1 is a fine place to start. >>>>>>> The >>>>>>> effect of different values should be small over a pretty wide range. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> iterations >>>>>>>> and reductions...here is my current argument set: >>>>>>>> >>>>>>>> args = new String[] { >>>>>>>> "--input", >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> "--output", config.getClustersDir(), >>>>>>>> "--modelClass", >>>>>>>> >>>>>>>> >>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >>>>>>>> "--maxIter", "15", >>>>>>>> "--alpha", "1.0", >>>>>>>> "--k", config.getClustersCount(), >>>>>>>> "--maxRed", "2" >>>>>>>> }; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Not off-hand. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >> > > -- Best regards, Bogdan --001636c5a8debb034e047d247046--