lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <>
Subject [jira] Created: (SOLR-2150) Anti-phrasing feature
Date Mon, 11 Oct 2010 12:35:33 GMT
Anti-phrasing feature

                 Key: SOLR-2150
             Project: Solr
          Issue Type: New Feature
          Components: SearchComponents - other
            Reporter: Jan Høydahl

Add an anti-phrasing feature to Solr.

Definition: Identifying word sequences in queries that do not contribute essentially to the
query's meaning, such as "Where can I find" or "Where is."

For general purpose search services, such as web, intranet, shopping search, some users will
try to write a question to the search engine, such as "how much is an ipod nano". One straight-forward
way of limiting the number of 0-hits in such environments is to apply anti-phrasing, which
uses a dictionary of common sentence prefixes which should be stripped from the incoming query
before it is sent further to search.

This can be implemented as a Search Component in Solr. The dictionary can be language independent.
We can encourage users to submit their tested anti-phrasing dictionaries for various languages,
and include those. The dictionary can be a set of simple .txt files, loaded in memory at startup
in an efficient data structure such as b-tree or finite state automaton to avoid redundancy
and ensure quick matching. The procedure for detecting an anti-phrase from the incoming query
is to first lookup the full query phrase, if no match, remove a word from the end, and do
another lookup until either a match or end of string. Example for query: "Who is Einstein?",
where "Who is" is defined as an anti phrase.
1. Lookup "Who is Einstein"
2. Lookup "Who is" (match), remove this prefix
3. Issue the query "Einstein" to search

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message