mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: Collocations in Mahout?
Date Sat, 09 Jan 2010 01:13:42 GMT
On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil <robin.anil@gmail.com> wrote:

> I like the Formulation that Drew made, using n-1 grams to generate n-grams.

I think Ted first mentioned n-1 grams, and I ran with it. It is very
useful to think about the problem this way.

One questions about the concept of n-1 grams however. When n is 3 for
example, are we really interested in the collocation of bigrams, or
are we interested in non-overlapping tokens? For example, given the
tri-gram 'click and clack', should we be looking at 'click and' and
'and clack', or are should we be analyzing 'click', 'and clack' or
'click and' and 'clack''? I suspect it is the first form because that
extends easilly to values larger than 3, but it's worth confirming.

Mime
View raw message