lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Potharaju <jspothar...@gmail.com>
Subject Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage
Date Tue, 13 Mar 2018 20:37:33 GMT
I am upgrading to solr 6.6.3 and one of my fields uses text_en_splitting.
Are there any recommendations on how to adjust the fieldtype definition for
these fields.
Thanks

Thanks
Jay Potharaju


On Wed, Feb 7, 2018 at 5:09 AM, Steve Rowe <sarowe@gmail.com> wrote:

> Thanks Webster,
>
> I created https://issues.apache.org/jira/browse/SOLR-11955 to work on
> this.
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 6, 2018, at 2:47 PM, Webster Homer <webster.homer@sial.com>
> wrote:
> >
> > I noticed that in some of the current example schemas that are shipped
> with
> > Solr, there is a fieldtype, text_en_splitting, that feeds the output
> > of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
> > this isn't supported, the example should probably be updated or removed.
> >
> > On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <sarowe@gmail.com> wrote:
> >
> >> Hi Александр,
> >>
> >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apache@elyograg.org> wrote:
> >>>
> >>> There should be no problem with using them together.
> >>
> >> I believe Shawn is wrong.
> >>
> >> From <http://lucene.apache.org/core/7_2_0/analyzers-common/
> >> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
> >>
> >>> NOTE: this cannot consume an incoming graph; results will be undefined.
> >>
> >> Unfortunately, the ref guide entry for Synonym Graph Filter <
> >> https://lucene.apache.org/solr/guide/7_2/filter-
> descriptions.html#synonym-
> >> graph-filter> doesn’t include a warning about this, but it should, like
> >> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/
> >> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
> >>
> >>> Note: although this filter produces correct token graphs, it cannot
> >> consume an input token graph correctly.
> >>
> >> (I’ve just committed a change to the ref guide source to add this also
> on
> >> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
> >> included in the ref guide for Solr 7.3.)
> >>
> >> In short, the combination of the two filters is not supported, because
> >> WDGF produces a token graph, which SGF cannot correctly interpret.
> >>
> >> Other filters also have this issue, see e.g. <
> https://issues.apache.org/
> >> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
> >> attention recently, and hopefully it will inspire fixes elsewhere.
> >>
> >> Patches welcome!
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >>
> >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apache@elyograg.org> wrote:
> >>>
> >>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
> >>>>
> >>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
> >>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
> >>>>
> >>>
> >>> There should be no problem with using them together.  But it is always
> >>> possible that the behavior will surprise you, while working 100% as
> >>> designed.
> >>>
> >>>> I have solr type configured in next way
> >>>>
> >>>> <fieldtype name="fulltext_en" class="solr.TextField"
> >>>> autoGeneratePhraseQueries="true">
> >>>>  <analyzer type="index">
> >>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
> >>>>            generateWordParts="1" generateNumberParts="1"
> >>>> splitOnNumerics="1"
> >>>>            catenateWords="1" catenateNumbers="1" catenateAll="0"
> >>>> preserveOriginal="1" protected="protwords_en.txt"/>
> >>>>    <filter class="solr.FlattenGraphFilterFactory"/>
> >>>>  </analyzer>
> >>>>  <analyzer type="query">
> >>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
> >>>>            generateWordParts="1" generateNumberParts="1"
> >>>> splitOnNumerics="1"
> >>>>            catenateWords="0" catenateNumbers="0" catenateAll="0"
> >>>> preserveOriginal="1" protected="protwords_en.txt"/>
> >>>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>>    <filter class="solr.SynonymGraphFilterFactory"
> >>>>            synonyms="synonyms_en.txt" ignoreCase="true"
> expand="true"/>
> >>>>  </analyzer>
> >>>> </fieldtype>
> >>>>
> >>>> So on query time it uses SynonymGraphFilterFactory after
> >>>> WordDelimiterGraphFilterFactory.
> >>>> Synonyms are configured in next way:
> >>>> b=>b,boron
> >>>> 2=>ii,2
> >>>>
> >>>> Query in solr analysis tool looks so. It is shown that terms after SGF
> >>>> have positions 3 and 4. Is it correct? I thought that they should had
> >>>> 1 and 2 positions.
> >>>>
> >>>
> >>> What matters is the *relative* positions.  The exact position number
> >>> doesn't matter much.  Something new that the Graph implementations use
> >>> is the position length.  That feature is necessary for multi-term
> >>> synonyms to function correctly in phrase queries.
> >>>
> >>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
> >>> created by splitting the input are at positions 1 and 2, which I think
> >>> is 100% as expected.  It also sets the positionLength of the first term
> >>> to 2, probably because it has split that term into 2 additional terms.
> >>>
> >>> Then the SGF takes those last two terms and expands them.  Each of the
> >>> synonyms is at the same position as the original term, and the relative
> >>> positions of the two synonym pairs have not changed -- the second one
> is
> >>> still one higher than the first.  I think the reason that SGF moves the
> >>> positions two higher is because the positionLength on the "b2" term is
> >>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
> >>> implementations may have to speak up as to whether this behavior is
> >> correct.
> >>>
> >>> Because the relative positions of the split terms don't change when SGF
> >>> runs, I think this is probably working as designed.
> >>>
> >>> Thanks,
> >>> Shawn
> >>
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message