lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7717) UnifiedHighlighter don't work with russian PrefixQuery
Date Tue, 28 Feb 2017 20:22:45 GMT


David Smiley commented on LUCENE-7717:

Here's my take on it:  The UnifiedHighlighter (and PostingsHighlighter from which it derives)
processes the MultiTermQueries (e.g. wildcards) in the query and creates multiple {{CharacterRunAutomaton}}
intended to match the same things.  {{CharacterRunAutomaton}} takes a {{Automaton}} as input,
and when it does it's processing, it matches the Character code points (integers from 0 to
0x10FFFF) against the integers in the Automaton.  However, this strategy assumes that the
Automaton was constructed based on character code points.  But {{AutomatonQuery.getAutomaton}}
is intended to match byte by byte (integers 0 to 255).  {{PrefixQuery.toAutomaton}} will get
2 bytes for the the "я" in BytesRef form, and add 2 states.  This does not line up with the
assumptions of CharacterRunAutomaton.

A short term immediate "fix" is simply to put AutomatonQuery last in the if-else list as Dmitry
indicated.  As such, PrefixQuery will work again.  This was broken by LUCENE-6367 (Lucene
5.1).  TermRangeQuery, which also now extends AutomatonQuery, will likewise work -- broken
by LUCENE-5879 (Lucene 5.2).  Again, back when MultiTermHighlighting was first written, neither
of those queries extended AutomatonQuery.  _But there will be bugs for other types of AutomatonQuery
(namely WildcardQuery and RegexpQuery) that have yet to be reported._

[~rcmuir] or [~mikemccand] I wonder if you have any thoughts on how to fix this.  An idea
I have is to _not_ use a CharacterRunAutomaton in the UnifiedHighlighter; use a ByteRunAutomaton
instead.  Then, add {{[] ...etc)}} that converts each character to
the equivalent UTF8 bytes to match.  Even with that, I wonder if this points to areas to improve
the automata API so that people don't bump into this trap in the future.  For example, maybe
have the Automata self-report if it's byte oriented, Unicode codepoint oriented, or something
custom.  Then, RunAutomaton could throw an exception if there is a mis-match.  However that
would be a runtime error; maybe the Automata could be typed.

Any way, what I'd like to do is do a short term fix that addresses many common cases and the
title of this issue.  And then do a more thorough fix in a follow-on issue.  [~ichattopadhyaya]
do you think this could go into 6.4.2 or are you only looking for "critical" issues?  It's
debatable what's critical and not.  This bug has been around since 5.1 so perhaps it isn't.

(a patch will follow shortly)

> UnifiedHighlighter don't work with russian PrefixQuery
> ------------------------------------------------------
>                 Key: LUCENE-7717
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 6.3, 6.4.1
>            Reporter: Dmitry Malinin
>            Assignee: David Smiley
>         Attachments: LUCENE-7717.patch
> UnifiedHighlighter highlighter = new UnifiedHighlighter(null, new StandardAnalyzer());
> Query query = new PrefixQuery(new Term("title", "я"));
> String testData = "я";
> Object snippet = highlighter.highlightWithoutSearcher(fieldName, query, testData, 1);
> System.out.printf("testData=[%s] Query=%s snippet=[%s]\n", testData, query, snippet==null?null:snippet.toString());

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message