From mahout-user-return-1128-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Tue Jul 28 05:47:32 2009 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 19869 invoked from network); 28 Jul 2009 05:47:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 05:47:32 -0000 Received: (qmail 54559 invoked by uid 500); 28 Jul 2009 05:48:48 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 54506 invoked by uid 500); 28 Jul 2009 05:48:48 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 54496 invoked by uid 99); 28 Jul 2009 05:48:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 05:48:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.217.218 as permitted sender) Received: from [209.85.217.218] (HELO mail-gx0-f218.google.com) (209.85.217.218) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 05:48:40 +0000 Received: by gxk18 with SMTP id 18so6159941gxk.5 for ; Mon, 27 Jul 2009 22:48:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=x1MIOAg0Sw2or5MMQHz9WtffsJWiXGe89Dg4xo9hoUo=; b=k1Uq7+nAfcPLVsc3evuSpDgYrUc7KkFfV2/54DM6wC4ezcdFXh37jTbOM0pJ/TAGur DkPMxNVQe0RH7mI3Cvj399kGPRoYFE8c+B7p1o8y2k0gwAf/KYVHe+hdVeCJnFjXC67B Up5s2EfAt/elKnzHxtUdDVkyaQUi1x5QnMAzA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=PtopbePVgR+tN5aqfy925w6zcOClytAqhKU7zwysmSj2XtQxcJZP6Bx1nNHxfskGje vm/0Ae0mAeGxsKI/tX+Cr7DEmzApT5pZhFmv6F4Zfoog4A+s1BvCvZW5SjN3rx0IbWPM cAGEPKvPZFKDQlwRz8Bpgihl6U89TCtTdyRjw= MIME-Version: 1.0 Received: by 10.150.136.5 with SMTP id j5mr12609577ybd.109.1248760099069; Mon, 27 Jul 2009 22:48:19 -0700 (PDT) In-Reply-To: <85d3c3b60907272238s360b00c5ub684901de9ca9037@mail.gmail.com> References: <85d3c3b60907272238s360b00c5ub684901de9ca9037@mail.gmail.com> From: Ted Dunning Date: Mon, 27 Jul 2009 22:47:59 -0700 Message-ID: Subject: Re: Decompose Compound Words? To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd56ada1d5f3b046fbd9ed1 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd56ada1d5f3b046fbd9ed1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit This is pretty easily done with a language model with at least reasonable models. The basic idea is to use a noisy channel model to say that there is an underlying "true" query that is corrupted by a noise process to get what we observe. We want to find the most likely "true" query. With some simple assumptions about the noise model, we can estimate a simple language model and the parameters of the noise model. This allows us to reconstruct an estimate of the "true" query for each novel query that we get. In practice, the noise model does not usually cause massive change in the query. That means that as a first approximation, we can use the observed queries to initialize the language model. Then, we can alternate steps finding a noise model (using queries held out from the language model estimation) and then deriving a sharpened estimate of the language model. For many of these spelling correction problems, the initial estimate of the language and noise models are good enough to use as is. A very moderate application of heuristics, usually regarding the form of acceptable corruptions that might be present in the noise process, or strong expectations on the frequency of certain corruptions, or some hand annotated queries is often very helpful. On Mon, Jul 27, 2009 at 10:38 PM, Jason Rutherglen < jason.rutherglen@gmail.com> wrote: > While not a machine learning problem, decomposing compound words > (marginalgrowth-> marginal growth) with Hadoop is useful in a > large search app? Lucene has DictionaryCompoundWordTokenFilter > however for a larger corpus it seems one would build the > dictionary first (i.e. build an index), then use the terms > dictionary to execute as the source for decomposing (and > probably not all the terms?). > > http://www.google.com/search?q=marginalgrowth 41,100 results > http://www.google.com/search?q=marginal+growth 8,390,000 results > http://www.google.com/search?q="marginal+growth" 41,100 results > > Looks like they're decomposing the query into a phrase query. > Probably a key -> value lookup on marginalgrowth. > -- Ted Dunning, CTO DeepDyve --000e0cd56ada1d5f3b046fbd9ed1--