lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage
Date Mon, 05 Feb 2018 16:27:02 GMT
Hi Александр,

> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apache@elyograg.org> wrote:
> 
> There should be no problem with using them together.

I believe Shawn is wrong.

From <http://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:

> NOTE: this cannot consume an incoming graph; results will be undefined.

Unfortunately, the ref guide entry for Synonym Graph Filter <https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-graph-filter>
doesn’t include a warning about this, but it should, like the warning on Word Delimiter
Graph Filter <https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:

> Note: although this filter produces correct token graphs, it cannot consume an input
token graph correctly.

(I’ve just committed a change to the ref guide source to add this also on the Synonym Graph
Filter and Managed Synonym Graph Filter entries, to be included in the ref guide for Solr
7.3.)

In short, the combination of the two filters is not supported, because WDGF produces a token
graph, which SGF cannot correctly interpret.

Other filters also have this issue, see e.g. <https://issues.apache.org/jira/browse/LUCENE-3475>
for ShingleFilter; this issue has gotten some attention recently, and hopefully it will inspire
fixes elsewhere.

Patches welcome!

--
Steve
www.lucidworks.com


> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apache@elyograg.org> wrote:
> 
> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>> 
>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>> 
> 
> There should be no problem with using them together.  But it is always
> possible that the behavior will surprise you, while working 100% as
> designed.
> 
>> I have solr type configured in next way
>> 
>> <fieldtype name="fulltext_en" class="solr.TextField"
>> autoGeneratePhraseQueries="true">
>>   <analyzer type="index">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="1" catenateNumbers="1" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.FlattenGraphFilterFactory"/>
>>   </analyzer>
>>   <analyzer type="query">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="0" catenateNumbers="0" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.LowerCaseFilterFactory"/>
>>     <filter class="solr.SynonymGraphFilterFactory"
>>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>   </analyzer>
>> </fieldtype>
>> 
>> So on query time it uses SynonymGraphFilterFactory after
>> WordDelimiterGraphFilterFactory.
>> Synonyms are configured in next way:
>> b=>b,boron
>> 2=>ii,2
>> 
>> Query in solr analysis tool looks so. It is shown that terms after SGF
>> have positions 3 and 4. Is it correct? I thought that they should had
>> 1 and 2 positions.
>> 
> 
> What matters is the *relative* positions.  The exact position number
> doesn't matter much.  Something new that the Graph implementations use
> is the position length.  That feature is necessary for multi-term
> synonyms to function correctly in phrase queries.
> 
> In your analysis screenshot, WDGF creates three tokens.  The two tokens
> created by splitting the input are at positions 1 and 2, which I think
> is 100% as expected.  It also sets the positionLength of the first term
> to 2, probably because it has split that term into 2 additional terms.
> 
> Then the SGF takes those last two terms and expands them.  Each of the
> synonyms is at the same position as the original term, and the relative
> positions of the two synonym pairs have not changed -- the second one is
> still one higher than the first.  I think the reason that SGF moves the
> positions two higher is because the positionLength on the "b2" term is
> 2, previously set by WDGF.  Someone with more knowledge about the Graph
> implementations may have to speak up as to whether this behavior is correct.
> 
> Because the relative positions of the split terms don't change when SGF
> runs, I think this is probably working as designed.
> 
> Thanks,
> Shawn


Mime
View raw message