mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: CardinalityException in DirichletDriver
Date Thu, 14 Jan 2010 18:56:10 GMT
unfortunately I am using private data which I cannot share. I am using
emails, indexed by Solr and then creating vectors out of them. I am using
them with k-means and everything is ok. Just wanted to try out the Dirichlet
algorithm.

On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

> I gather you are doing text clustering? Are you using one of our example
> datasets or one which is publicly available?
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> What kind of details do you need to continue?
>> In the mean time I am anyway going back to kmeans (maybe I really start
>> with
>> adding canopy to my kmeans only scenario first ;)).
>>
>> Best regards,
>> Bogdan
>>
>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> I think KMeans and Canopy are the most-used and therefore the most
>>> robust.
>>> Dirichlet still has not seen much use beyond some test examples and
>>> NormalModel has at least one known problem (with sample() only returning
>>> the
>>> maximum likelihood) that has been reported but never fixed. Can you point
>>> me
>>> to the problem you are running so I can try to get up to speed? It has
>>> been
>>> some time since I worked in this code but I'm keen to do so and I have
>>> some
>>> time to invest.
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> But I am the first one to use Dirichlet which algorithm is the
>>>> recommended
>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>
>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>> jdog@windwardsolutions.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>> are
>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>> sizes
>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>> 'DenseVector(n)',
>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>> math
>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>> to
>>>>> use Dirichlet.
>>>>>
>>>>>
>>>>>
>>>>> Bogdan Vatkov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>
>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>> NormalModel))
>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>> DirichletState<O>.<init>(ModelDistribution<O>,
int, double, int, int)
>>>>>> line:
>>>>>> 48
>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>> line:
>>>>>> 150
>>>>>> DirichletDriver.runJob(String, String, String, int, int, double,
int)
>>>>>> line:
>>>>>> 133
>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>> Clusters.doClustering() line: 244
>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>> Clusters$1.run() line: 148
>>>>>> Thread.run() line: 619
>>>>>>
>>>>>>
>>>>>> public class NormalModelDistribution implements
>>>>>> ModelDistribution<Vector>
>>>>>> {
>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany)
{
>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int
i = 0; i <
>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1);
}
>>>>>> return
>>>>>> result; }
>>>>>>
>>>>>> and later this vector is dotted to
>>>>>>  @Override
>>>>>>  public double pdf(Vector x) {
>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2
*
>>>>>> sd2);
>>>>>>  double ex = Math.exp(exp);
>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>  }
>>>>>>
>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>> function:
>>>>>>
>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>> reporter)
>>>>>> throws IOException {
>>>>>>
>>>>>>
>>>>>> any idea?
>>>>>>
>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk?
is it
>>>>>> safe
>>>>>> enough to run against trunk?
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>> bogdan.vatkov@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It means that there is probably a programming bug somehow.  At
the
>>>>>>> very
>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> dot product is a vector operation that is the sum of products
of
>>>>>>> corresponding elements of the two vectors being operated on.
 If
>>>>>>> these
>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>
>>>>>>> what should I investigate?
>>>>>>>  I am not familiar with the code, but if I had time to look,
my
>>>>>>> strategy
>>>>>>> would be to start in the NormalModel and work back up the stack
trace
>>>>>>> to
>>>>>>> find out how the vectors came to be different lengths.  No doubt,
the
>>>>>>> code
>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>> vectors
>>>>>>> are
>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>> they
>>>>>>> come
>>>>>>> from.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am basically running my complete kmeans scenario (same
input data,
>>>>>>>> same
>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>> step
>>>>>>>> with a DirichletDriver.main call...of course the arguments
are
>>>>>>>> adjusted
>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I would think that this sounds very plausible.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would
scan
>>>>>>> with
>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.
 (i.e.
>>>>>>> 0.01,
>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to
start.
>>>>>>>  The
>>>>>>> effect of different values should be small over a pretty wide
range.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> iterations
>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>
>>>>>>>> args = new String[] {
>>>>>>>> "--input",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>> "--modelClass",
>>>>>>>>
>>>>>>>>
>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>> "--maxIter", "15",
>>>>>>>> "--alpha", "1.0",
>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>> "--maxRed", "2"
>>>>>>>> };
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Not off-hand.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message