lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1999) Match spotter for all query types
Date Wed, 21 Oct 2009 14:31:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768257#action_12768257
] 

Mark Harwood commented on LUCENE-1999:
--------------------------------------

bq. and 2) you need it for every single doc visited by the query

Actually I don't need it for every doc, only the top ones  - it just happens to be so cheap
to produce that I can afford to run this in-line with the query. (I haven't actually benchmarked
it at scale buy my gut feel is it would be fast )

I was thinking that this might be orthogonal to the existing "free-text" based highlighter.
The logic for this being roughly that

1) Highlighting of free-text fields is reasonably well-catered for with summarisation etc.
2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, Cached term filters
on enums eg gender:male/female) are all likely to be non-free-text fields which don't require
summarisation and only contain a single value.

I may be wrong in these assumptions about the existing state of play (any thoughts, Mark M?)
but it might be useful to think of attacking the problem with these 2 different requirements
in mind.

Regardless of type e.g. int, long etc I tend to think of fields as falling into these broad
usage categories:

a) "Identifiers" (e.g. primary keys)
b) Quantifiers (e.g numerics, dates, spatial)
c) Free-text 
d) Controlled vocabularies (e.g. enums such as gender:m/f)

Type a ) is catered for with a straight TermQuery and therefore can be handled with the existing
highlighter
Type b) needs special indexes/queries (spatial/trie) and isn't catered for by the existing
term/span-based Highlighter
Type c) is catered for with the existing highlighter and its summarising features
Type d) involves many TermDoc.next reads so is usefully cached as filters and therefore not
catered for by existing Highlighter

So this patch helps cater for types b) and d) where simply knowing the field matched is all
that is required to highlight.


> Match spotter for all query types
> ---------------------------------
>
>                 Key: LUCENE-1999
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1999
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.9
>            Reporter: Mark Harwood
>         Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial,
cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match info as
flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore highlight) which
fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to record docId,
score AND a matchFlag byte in ScoreDoc objects and collector APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message