lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Tue, 14 Feb 2017 16:18:20 GMT
Here's the new blog post I mentioned earlier in the thread, trying to
explain the recent changes to make multi-token synonyms work ... it
just went out today:
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 6:19 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> Hi Bernd,
>
> Actually, pos (which is just the accumulation of
> PositionIncrementAttribute, starting with -1) is the *start* node.
>
> The end node is then pos + PositionLengthAttribute.
>
> As far as I know, ShingleFilter is not yet graph friendly: it does not
> set PositionLengthAttribute.  But you could visualize how it should be
> setting it ...
>
> Also, note that synonym filter cannot handle an incoming graph
> properly, so if you run ShingleFilter before it, it's not going to do
> the right thing.  For that we need something like
> https://issues.apache.org/jira/browse/LUCENE-5012 ... the branch for
> that issue already has a SynonymFilter that accepts incoming graphs,
> but it's a biggish change.
>
> The docs are indeed out-dated; I'll repair them.  Thank you!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 2:41 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de> wrote:
>> Now it is getting more clear.
>>
>> "pos" (aka position) starts at "-1" and its highest number is the
>> last "node id" of the graph.
>>
>> "pos" minus "positionLength" is the starting "node id" of the arc.
>>
>> Is the tokenStream after each filter always a valid graph?
>>
>> E.g. ShingleFilter with query "natural forest":
>> SF        text start  end  positionLength  type    position
>>        natural 0      7    1               word    1
>> natural forest 0      14   2               shingle 1
>>         forest 8      14   1               word    2
>>
>> (0)--- natural --->(1)--- forest --->(2)
>> But how to insert the shingle into this graph?
>>
>> This is why I added a SynonymPreFilter to correct the graph between
>> ShingleFilter and SynonymGraphFilter. But I had the wrong understanding
>> of pos, positionIncrement, positionLength,...
>>
>>
>> Another question, the API docs say "...Injecting synonyms – here,
>> synonyms of a token should be added after that token..."
>> But as I already mentioned the synonyms are added before the token.
>> Are the docs outdated?
>>
>>
>> Regards
>> Bernd
>>
>>
>> Am 13.02.2017 um 17:31 schrieb Michael McCandless:
>>> On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>
>>>> Am I confused by the naming of pos, positionIncrement, offset, positionLength,
>>>> start and end between Lucene and Solr?
>>>
>>> "pos" is just accumulating the positionIncrement values, starting from
>>> -1.  I don't think Solr's analysis UI would change the meaning of
>>> these attributes.
>>>
>>>> OK, the SynonymGraphFilter is ONLY for Lucene, right?
>>>
>>> No, it's also for Solr and Elasticsearch and any other search servers
>>> on top of Lucene as well.
>>>
>>>> But how are you going to build the multi-word synonym query "natürlicher
wald"
>>>> from "natural forest"?
>>>
>>> Lucene's and Elasticsearch's query parsers have already been fixed to
>>> correctly handle token graphs by default; Solr has a fork of Lucene's
>>> query parser I think ... I'm not sure if it's been fixed yet to
>>> interpret graphs.
>>>
>>> See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
>>> https://issues.apache.org/jira/browse/LUCENE-7638
>>>
>>>> And how are you going to highlight a synonym hit for "natürlicher wald"
>>>> when start and end is set to 0-14 and not to 0-18?
>>>> Or is start and end not used for highlighting?
>>>
>>> This start/end offset, at query time, is not normally used.  If you
>>> have a document in the index that has "natürlicher wald" then it would
>>> have offsets X to X+18, stored in the index ideally as postings
>>> offsets, and should highlight correctly?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>>>>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>>>>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>>>>
>>>>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>>>>
>>>>> And when I run it, it produces the correct token graph:
>>>>>
>>>>> TOKEN: naturwald
>>>>>   offset: 0-14
>>>>>   pos: 0-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: forêt
>>>>>   offset: 0-14
>>>>>   pos: 0-1
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: natürlicher
>>>>>   offset: 0-14
>>>>>   pos: 0-2
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: natural
>>>>>   offset: 0-7
>>>>>   pos: 0-3
>>>>>   type: word
>>>>>
>>>>> TOKEN: naturelle
>>>>>   offset: 0-14
>>>>>   pos: 1-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: wald
>>>>>   offset: 0-14
>>>>>   pos: 2-4
>>>>>   type: SYNONYM
>>>>>
>>>>> TOKEN: forest
>>>>>   offset: 8-14
>>>>>   pos: 3-4
>>>>>   type: word
>>>>>
>>>>> Remember that the "pos: " output above is really "node IDs" and you
>>>>> can see the inserted side paths are correct.  The offsets are
>>>>> necessarily always 0-14 for inserted tokens because that is the span
>>>>> of the two original tokens.
>>>>>
>>>>> Can you try removing the SPF filters in your test?  Or otherwise
>>>>> simplify your test so it's closer to what my test case is doing?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>>>>> <lucene@mikemccandless.com> wrote:
>>>>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>> My very simple and small sysonym_test.txt has only one line:
>>>>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\
wald
>>>>>>>
>>>>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>>>>>>> the result is:
>>>>>>>
>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>      natural     0      7    1               word  1
>>>>>>>       forest     8      14   1               word  2
>>>>>>>
>>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>>      natural     0      7    3               word     1
>>>>>>>    naturelle     0      14   3               SYNONYM  2
>>>>>>>         wald     0      14   2               SYNONYM  3
>>>>>>>    naturwald     0      14   4               SYNONYM  1
>>>>>>>        forêt     0      14   1               SYNONYM  1
>>>>>>>  natürlicher     0      14   2               SYNONYM  1
>>>>>>>
>>>>>>>       forest     8      14   1               word     4
>>>>>>>
>>>>>>> The result is some kind of rubbish.
>>>>>>> Also note the empty line between "natürlicher" and "forest".
>>>>>>>
>>>>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>>>>
>>>>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>>>>      First is SynonymPreFilter to set all attributes to the right
value,
>>>>>>>      second is SynonymPostFilter to again fix all attribute settings
but
>>>>>>>      also set multi-word synonyms as phrase and also cleanup
the result
>>>>>>>      of SGF.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bernd
>>>>>>>
>>>>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>>>>> Yeah, those tokens should have position length 2.
>>>>>>>>
>>>>>>>> Can you reduce to a small set of synonyms and text?  If you
use only
>>>>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>>> Example for position end and positionLength of SGF.
>>>>>>>>>
>>>>>>>>> query: natural forest
>>>>>>>>>
>>>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>>>         natural  0      7    1               word  1
>>>>>>>>>         forest   8      14   1               word  2
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> SPF     text     start  end  positionLength  type   
 position
>>>>>>>>>         natural  0      7    1               word   
 1
>>>>>>>>>  natural forest  0      14   2               shingle
 2
>>>>>>>>>         forest   8      14   1               word   
 3
>>>>>>>>>
>>>>>>>>> SGF     text     start  end  positionLength  type   
 position
>>>>>>>>>         natural  0      7    1               word   
 1
>>>>>>>>>       naturwald  0      14   1               SYNONYM
 2
>>>>>>>>> forêt naturelle  0      14   1               SYNONYM
 2
>>>>>>>>> natürlicher wald 0      14   1               SYNONYM
 2
>>>>>>>>>  natural forest  0      14   1               shingle
 2
>>>>>>>>>          forest  8      14   1               word   
 3
>>>>>>>>>
>>>>>>>>> SPF     text     start  end  positionLength  type   
 position
>>>>>>>>>         natural  0      7    1               word   
 1
>>>>>>>>>       naturwald  0      9    1               SYNONYM
 2
>>>>>>>>> "forêt naturelle"  0    17   2               SYNONYM
 2
>>>>>>>>> "natürlicher wald" 0    18   2               SYNONYM
 2
>>>>>>>>> "natural forest" 0      16   2               shingle
 2
>>>>>>>>>          forest  8      14   1               word   
 3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same
position end and positionLength.
>>>>>>>>> I suppose that it is not correct?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bernd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>>>>> I tried SynonymGraphFilter with my setup and
it works right away.
>>>>>>>>>>> It payed of that I did some modifications on
my filters while
>>>>>>>>>>> testing 6.3 with my setup.
>>>>>>>>>>
>>>>>>>>>> Good!
>>>>>>>>>>
>>>>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter
and did not
>>>>>>>>>>> use FlattenGraphFilter, pretty simple. So I can
confirm that, up
>>>>>>>>>>> to this point, SynonymGraphFilter is a full replacement
for
>>>>>>>>>>> SynonymFilter. At least for search-time synonym
handling.
>>>>>>>>>>>
>>>>>>>>>>> But this also means there is still some work
with the attributes, right?
>>>>>>>>>>> Position looks good, type and start are no problem
anyway, but
>>>>>>>>>>> the end position is still wrong and the positionLength
for multi-word
>>>>>>>>>>> synonyms.
>>>>>>>>>>
>>>>>>>>>> Can you give an example or make a small test case?
>>>>>>>>>> PositionLengthAttribute is supposed to be correct
coming out of
>>>>>>>>>> SynonymGraphFilter.
>>>>>>>>>>
>>>>>>>>>>> One thing I noticed was that the originating
token which "produces"
>>>>>>>>>>> synonyms comes out last from SynonymGraphFilter,
after the
>>>>>>>>>>> "produced" synonyms.
>>>>>>>>>>> I will have a look inside with debugger but I
guess this is due
>>>>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>>>>
>>>>>>>>>> Yeah they do come out in a different order, which
token filters are
>>>>>>>>>> allowed to do in general for all tokens leaving from
the same position
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Mike McCandless
>>>>>>>>>>
>>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message