mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: n-gram over-representation?
Date Tue, 16 Feb 2010 19:09:55 GMT
This comparison is very interesting when against a general corpus or
specific sub-corpus already in your data.

You will often find that an n-gram is in one corpus an not in another, but
the question becomes how much this happens (i.e. does LLR say that this
happens enough to be interesting).  Taking the max over scores of many
comparisons becomes the interesting number then.

On Tue, Feb 16, 2010 at 11:01 AM, Drew Farris <> wrote:

> I also was wondering if comparing the ngrams found in this corpus
> against a general corpus could be a worthwhile endeavor? Some quick
> and dirty work suggests that the overlap in n-grams between this
> domain-specific corpus and a general one is pretty low.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message