mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: n-gram over-representation?
Date Tue, 16 Feb 2010 18:38:38 GMT
I think that as far as pure corpus analysis is concerned, LLR, min/max DF
and tf-idf are about as good as you will get.  TF-idf is, in fact, an
approximation of LLR, so I don't even think you need to use that (and it is
document centered rather than corpus centric in any case).  You might get
some mileage out of looking for terms that have highly variable LLR in
different documents.

To get a substantial improvement over these measures, I would recommend
adding new data to the mix.  The new data I would look at first is some sort
of user behavior history.  Do you have anything like that?

On Tue, Feb 16, 2010 at 10:22 AM, Drew Farris <> wrote:

> Yes, I'm using the LLR score. I was wondering if there is anything
> else I should be looking at other than LLR and min/max DF.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message