mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Decompose Compound Words?
Date Tue, 28 Jul 2009 05:47:59 GMT
This is pretty easily done with a language model with at least reasonable

The basic idea is to use a noisy channel model to say that there is an
underlying "true" query that is corrupted by a noise process to get what we
observe.  We want to find the most likely "true" query.  With some simple
assumptions about the noise model, we can estimate a simple language model
and the parameters of the noise model.  This allows us to reconstruct an
estimate of the "true" query for each novel query that we get.

In practice, the noise model does not usually cause massive change in the
query.  That means that as a first approximation, we can use the observed
queries to initialize the language model.  Then, we can alternate steps
finding a noise model (using queries held out from the language model
estimation) and then deriving a sharpened estimate of the language model.

For many of these spelling correction problems, the initial estimate of the
language and noise models are good enough to use as is.

A very moderate application of heuristics, usually regarding the form of
acceptable corruptions that might be present in the noise process, or strong
expectations on the frequency of certain corruptions, or some hand annotated
queries is often very helpful.

On Mon, Jul 27, 2009 at 10:38 PM, Jason Rutherglen <> wrote:

> While not a machine learning problem, decomposing compound words
> (marginalgrowth-> marginal growth) with Hadoop is useful in a
> large search app? Lucene has DictionaryCompoundWordTokenFilter
> however for a larger corpus it seems one would build the
> dictionary first (i.e. build an index), then use the terms
> dictionary to execute as the source for decomposing (and
> probably not all the terms?).
> 41,100 results
> 8,390,000 results
>"marginal+growth" 41,100 results
> Looks like they're decomposing the query into a phrase query.
> Probably a key -> value lookup on marginalgrowth.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message