Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 70957 invoked from network); 11 Oct 2010 12:36:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Oct 2010 12:36:03 -0000 Received: (qmail 55098 invoked by uid 500); 11 Oct 2010 12:36:02 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 54686 invoked by uid 500); 11 Oct 2010 12:35:58 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 54676 invoked by uid 99); 11 Oct 2010 12:35:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Oct 2010 12:35:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Oct 2010 12:35:55 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o9BCZXrB029374 for ; Mon, 11 Oct 2010 12:35:33 GMT Message-ID: <26497960.75251286800533238.JavaMail.jira@thor> Date: Mon, 11 Oct 2010 08:35:33 -0400 (EDT) From: =?utf-8?Q?Jan_H=C3=B8ydahl_=28JIRA=29?= To: dev@lucene.apache.org Subject: [jira] Created: (SOLR-2150) Anti-phrasing feature MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Anti-phrasing feature --------------------- Key: SOLR-2150 URL: https://issues.apache.org/jira/browse/SOLR-2150 Project: Solr Issue Type: New Feature Components: SearchComponents - other Reporter: Jan H=C3=B8ydahl Add an anti-phrasing feature to Solr. Definition: Identifying word sequences in queries that do not contribute es= sentially to the query's meaning, such as "Where can I find" or "Where is." (Source: http://www.google.com/search?q=3Ddefine%3Aanti+phrasing) For general purpose search services, such as web, intranet, shopping search= , some users will try to write a question to the search engine, such as "ho= w much is an ipod nano". One straight-forward way of limiting the number of= 0-hits in such environments is to apply anti-phrasing, which uses a dictio= nary of common sentence prefixes which should be stripped from the incoming= query before it is sent further to search. This can be implemented as a Search Component in Solr. The dictionary can b= e language independent. We can encourage users to submit their tested anti-= phrasing dictionaries for various languages, and include those. The diction= ary can be a set of simple .txt files, loaded in memory at startup in an ef= ficient data structure such as b-tree or finite state automaton to avoid re= dundancy and ensure quick matching. The procedure for detecting an anti-phr= ase from the incoming query is to first lookup the full query phrase, if no= match, remove a word from the end, and do another lookup until either a ma= tch or end of string. Example for query: "Who is Einstein?", where "Who is"= is defined as an anti phrase. 1. Lookup "Who is Einstein" 2. Lookup "Who is" (match), remove this prefix 3. Issue the query "Einstein" to search --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org