lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Davidson <alex.david...@bluewire-technologies.com>
Subject How to do prefix/phrase matching with term-length-sensitive scoring?
Date Wed, 03 Mar 2010 15:27:39 GMT
Hi,

We are using Lucene.NET 2.9.1 to index names/addresses/identification
numbers for various classes of person entity. The index is used to
populate a 'live search' poplist control in a web page, as the user
types. The total number of entries to index exceeds a million and the
poplist only displays the top 8, which makes match ranking difficult
since the user expects something useful from as few as three characters
(eg. two from surname, plus one initial).

The index is updated in real-time by adding update requests to a task
queue, to prevent indexing errors from rolling back database
transactions, and a transaction log, to ensure that pending index
updates are not forgotten if the app pool is killed/recycled. There is
only one IndexReader and IndexWriter per index. The Reader is reopened
periodically via the writer.

The Question:

Given a list of prefixes, what is the simplest way to match them against
a text field, giving preference to shorter term matches?
 * It is required that short prefixes are included usefully in the
query, so simply generating TermQuerys is not viable (see below).
 * Term frequency within the field must be ignored when scoring.
 * Documents and fields are sometimes boosted at index time; norms are
present.
 * The query generator returns only a Query object; it cannot affect the
Filters passed to the Searcher.
 * It would be preferable, though not required, if matching/scoring
based on term position (a la PhraseQuery) was available as an option.



Previous Implementation:

In a previous iteration (in Lucene.NET 2.0), with a single index
directly available to the query generator, we did the following:
 * For each prefix p with length pL:
   * Look up all relevant terms via an IndexReader,
   * Create a list of TermQuerys, weighting each term t (length tL)
according to:
			pL/(tL + tL*(tL - pL))
   * Combine TermQuery list with Occur.SHOULD
 * Combine BooleanQuery list with Occur.MUST

PrefixQuery seemed to do similar stuff, but our implementation also had
some logic to prefer filters for very short prefixes with lots of terms.

Current Attempt:

Architectural changes mean that this query generation code no longer has
access to an IndexReader and can only use a Searcher (to allow use of
MultiSearchers, etc). Since the Searcher does not expose a Reader, we've
gone back to using PrefixQuery and applied SCORING_BOOLEAN_QUERY_REWRITE
to get the term length weightings.
This is as good as the old implementation accuracy-wise, except that it
doesn't cope well with huge numbers of matching terms for a short
prefix, which is quite a common case.



Is there a way of achieving this within Lucene's existing Query and
scoring implementations, or will I have to roll my own? If the latter,
should I be looking at deriving from Query implementations or scorers?


Thanks!



Mime
View raw message