Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of bogdan.vatkov@gmail.com
 designates 74.125.78.26 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=XVcr6/4VcJBLSKW2paYhAnIiQR+H94mV8TGZjkYPavK1Q4/oOPl64VQaeFtY7Npztb
         DO6aOatDctaeGjCwm4wSszVO8GM3Nv3EPnvm9I2XTL0yAa6AGoK28WnkMGvDnnDbqcVL
         mArRDnw/9t0SY2UrWTtTvPR9+NnYbJauEYqzA=
MIME-Version: 1.0
In-Reply-To: <4B4F672B.8020405@windwardsolutions.com>
References: <e9c993f11001121546x4ce65c9byaed0d98de5f2f5ba@mail.gmail.com>
	 <263CFF4F-85F1-4F2E-A030-C2F50770A27E@apache.org>
	 <e9c993f11001131153t68897181l456bde7b4b92af28@mail.gmail.com>
	 <c7d45fc71001131213t5056e95fv9b953ea1b7f98b03@mail.gmail.com>
	 <e9c993f11001131307y55187d36o26fbbdb36aa5d4f4@mail.gmail.com>
	 <4B4E55FB.4010709@windwardsolutions.com>
	 <e9c993f11001131526p4bc0339fv24eb574ef57bde92@mail.gmail.com>
	 <4B4E5C0B.10803@windwardsolutions.com>
	 <e9c993f11001131626p3f2f22caoe651e629271c51dd@mail.gmail.com>
	 <4B4F672B.8020405@windwardsolutions.com>
Date: Thu, 14 Jan 2010 20:56:10 +0200
Message-ID: <e9c993f11001141056x69183843p8c9ed8e6e5ba3bbd@mail.gmail.com>
Subject: Re: CardinalityException in DirichletDriver
From: Bogdan Vatkov <bogdan.vatkov@gmail.com>
To: mahout-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001636c5a8debb034e047d247046

--001636c5a8debb034e047d247046
Content-Type: text/plain; charset=ISO-8859-1

unfortunately I am using private data which I cannot share. I am using
emails, indexed by Solr and then creating vectors out of them. I am using
them with k-means and everything is ok. Just wanted to try out the Dirichlet
algorithm.

On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

> I gather you are doing text clustering? Are you using one of our example
> datasets or one which is publicly available?
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> What kind of details do you need to continue?
>> In the mean time I am anyway going back to kmeans (maybe I really start
>> with
>> adding canopy to my kmeans only scenario first ;)).
>>
>> Best regards,
>> Bogdan
>>
>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> I think KMeans and Canopy are the most-used and therefore the most
>>> robust.
>>> Dirichlet still has not seen much use beyond some test examples and
>>> NormalModel has at least one known problem (with sample() only returning
>>> the
>>> maximum likelihood) that has been reported but never fixed. Can you point
>>> me
>>> to the problem you are running so I can try to get up to speed? It has
>>> been
>>> some time since I worked in this code but I'm keen to do so and I have
>>> some
>>> time to invest.
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> But I am the first one to use Dirichlet which algorithm is the
>>>> recommended
>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>
>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>> jdog@windwardsolutions.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>> are
>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>> sizes
>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>> 'DenseVector(n)',
>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>> math
>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>> to
>>>>> use Dirichlet.
>>>>>
>>>>>
>>>>>
>>>>> Bogdan Vatkov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>
>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>> NormalModel))
>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>>> line:
>>>>>> 48
>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>> line:
>>>>>> 150
>>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>>> line:
>>>>>> 133
>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>> Clusters.doClustering() line: 244
>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>> Clusters$1.run() line: 148
>>>>>> Thread.run() line: 619
>>>>>>
>>>>>>
>>>>>> public class NormalModelDistribution implements
>>>>>> ModelDistribution<Vector>
>>>>>> {
>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>>> return
>>>>>> result; }
>>>>>>
>>>>>> and later this vector is dotted to
>>>>>>  @Override
>>>>>>  public double pdf(Vector x) {
>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>>> sd2);
>>>>>>  double ex = Math.exp(exp);
>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>  }
>>>>>>
>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>> function:
>>>>>>
>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>> reporter)
>>>>>> throws IOException {
>>>>>>
>>>>>>
>>>>>> any idea?
>>>>>>
>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>>> safe
>>>>>> enough to run against trunk?
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>> bogdan.vatkov@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It means that there is probably a programming bug somehow.  At the
>>>>>>> very
>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> dot product is a vector operation that is the sum of products of
>>>>>>> corresponding elements of the two vectors being operated on.  If
>>>>>>> these
>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>
>>>>>>> what should I investigate?
>>>>>>>  I am not familiar with the code, but if I had time to look, my
>>>>>>> strategy
>>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>>> to
>>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>>> code
>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>> vectors
>>>>>>> are
>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>> they
>>>>>>> come
>>>>>>> from.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>>> same
>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>> step
>>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>>> adjusted
>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I would think that this sounds very plausible.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>>> with
>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>>> 0.01,
>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>>  The
>>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> iterations
>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>
>>>>>>>> args = new String[] {
>>>>>>>> "--input",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>> "--modelClass",
>>>>>>>>
>>>>>>>>
>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>> "--maxIter", "15",
>>>>>>>> "--alpha", "1.0",
>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>> "--maxRed", "2"
>>>>>>>> };
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Not off-hand.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

--001636c5a8debb034e047d247046--