This is pretty easily done with a language model with at least reasonable models. The basic idea is to use a noisy channel model to say that there is an underlying "true" query that is corrupted by a noise process to get what we observe. We want to find the most likely "true" query. With some simple assumptions about the noise model, we can estimate a simple language model and the parameters of the noise model. This allows us to reconstruct an estimate of the "true" query for each novel query that we get. In practice, the noise model does not usually cause massive change in the query. That means that as a first approximation, we can use the observed queries to initialize the language model. Then, we can alternate steps finding a noise model (using queries held out from the language model estimation) and then deriving a sharpened estimate of the language model. For many of these spelling correction problems, the initial estimate of the language and noise models are good enough to use as is. A very moderate application of heuristics, usually regarding the form of acceptable corruptions that might be present in the noise process, or strong expectations on the frequency of certain corruptions, or some hand annotated queries is often very helpful. On Mon, Jul 27, 2009 at 10:38 PM, Jason Rutherglen < jason.rutherglen@gmail.com> wrote: > While not a machine learning problem, decomposing compound words > (marginalgrowth-> marginal growth) with Hadoop is useful in a > large search app? Lucene has DictionaryCompoundWordTokenFilter > however for a larger corpus it seems one would build the > dictionary first (i.e. build an index), then use the terms > dictionary to execute as the source for decomposing (and > probably not all the terms?). > > http://www.google.com/search?q=marginalgrowth 41,100 results > http://www.google.com/search?q=marginal+growth 8,390,000 results > http://www.google.com/search?q="marginal+growth" 41,100 results > > Looks like they're decomposing the query into a phrase query. > Probably a key -> value lookup on marginalgrowth. > -- Ted Dunning, CTO DeepDyve