mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: n-gram and ml
Date Sat, 09 Jun 2012 22:51:20 GMT
Pat, the issue I am guessing is that you have some docs which "only" have
high frequency words. This could be junk documents, and they may be
creating trouble for you while clustering. The recent patch that jeff
checked in will help alleviate that issue. My thought is to completely
exclude vectors if they are empty in the encoder job. For you, things might
just work ok now as the original distance measure bug is fixed.

Robin Anil

On Sat, Jun 9, 2012 at 5:03 PM, Pat Ferrel <> wrote:

> OK, thanks. I'm trying to find ways to reduce dimensionality in some
> reasonable way before proceeding to more heavyweight methods.
> So my understanding of seq2sparse n-grams seems to be correct. I don't
> want many. Set to 200 I get some nonsensical ones, maybe 2000 is too high,
> I think MiA mentions 1000 as a pretty high value.
> As to df pruning, I thought x = 40 meant that if a term appeared in more
> than 40% of the docs it was removed. For my 150,000 page crawl it didn't
> seem like an unreasonable number. If the intuition says differently what
> would be a good number? Maybe I should use maxDFSigma instead - maybe set
> to 3.0 as the help suggests?
> On 6/9/12 11:39 AM, Robin Anil wrote:
>> ------
>> Robin Anil
>> On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel<>
>>  wrote:
>>  As I understand it when using seq2sparse with ng = 2 and ml = some large
>>> number. This will never create a vector with less terms than words (all
>>> other pars of the algorithm set aside). In other words ng = 2 and ml =
>>> 2000
>>> will create very few n-grams but will never create a 0 length vector
>>> unless
>>> there are no terms to begin with.
>>> Is this correct?
>>> I ask because it looks like many of my n-grams are not really helpful so
>>> I
>>> keep tuning the ml upwards but Robin made a comment that this might
>>> cause 0
>>> length vectors, in which case I might want to stop using n-grams.
>>>  You didnt quite get me.
>> I meant ml = minimum log likelihood threshold. an bigram of loglikelihood
>> 1.0 is quite a significant ngram. if you say  ml>  2000, there might not
>> be
>> any ngram that has such a score. Secondly, df pruning of 40% along with ml
>>> 200 threshold are creating vectors in your dataset devoid of features,
>>> i.e
>> empty vectors.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message