mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: CardinalityException in DirichletDriver
Date Tue, 19 Jan 2010 00:54:39 GMT
Highjacking the sparse vectorizer from the SGD patch might help with this.
Likewise, using an L-1 model distribution would enforce sparseness by nature
(I think).  Sampling from the L-1 prior might be a bit of a trip.

On Mon, Jan 18, 2010 at 4:27 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

> I think you will need to bound your model dimensionality to use Dirichlet.
> If you are using TF-IDF vectors to represent your documents I would think
> these would all have the same maximum cardinality which you could specify
> for the modelPrototype size. I just committed a new model distribution
> (SparseNormalModelDistribution) that includes a heuristic
> sampleFromPosterior() to remove small mean element values to preserve model
> sparseness. It's probably bogus but a place to begin.
>
> I have also written one new unit test that runs in memory over a small,
> 50-d sparse model and 100, 50-d sparse vectors. It does not explode.
>
> Just do another update before you begin to pick up those changes.
>
>
> Bogdan Vatkov wrote:
>
>> Well, dimensions - I am just using slightly modified version of
>> LuceneDriver
>> (added stopword removal and regex removal of incoming terms), so I guess
>> it
>> is just a list of unidimentional vectors of random length.
>> I will try to run the new code tomorrow.
>>
>> On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
>> <jdog@windwardsolutions.com>wrote:
>>
>>
>>
>>> Yes, they're all in trunk. Just do an svn update and mvn install to get
>>> them.
>>>
>>> BTW, what's the dimensionality of your data?
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>
>>>> I will try with the NormalModelDistribution but I am wondering how to
>>>> obtain
>>>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>>>> source containing the changes, do I simply sync from trunk? I suppose I
>>>> have
>>>> to run mvn install after that, right?
>>>>
>>>> Best regards,
>>>> Bogdan
>>>>
>>>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <
>>>> jdog@windwardsolutions.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> Bogdan,
>>>>>
>>>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>>>> suggest starting with the NormalModelDistribution with a large sparse
>>>>> vector
>>>>> as its prototype.  The other model distributions create sampled values
>>>>> for
>>>>> all the prior model dimensions, negating any value of using sparse
>>>>> vectors
>>>>> for their prototypes.
>>>>>
>>>>> It may in fact be necessary to introduce a new ModelDistribution and
>>>>> Model
>>>>> so that sparse model elements will not fill up with insignificant
>>>>> values.
>>>>> After the first iteration computes the new posterior model parameters
>>>>> from
>>>>> the observations, many of these values will likely be small so some
>>>>> heuristic would be needed to preserve model sparseness by removing them
>>>>> altogether. If all these values are retained, it is probably better to
>>>>> use a
>>>>> dense vector representation. A 50k-dimensional model will be a real
>>>>> compute
>>>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>>>> sample() would be good places to embed this heuristic.
>>>>>
>>>>> I'll begin writing some tests to experiment with these models.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message