lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Tue, 14 Feb 2017 11:19:19 GMT
Hi Bernd,

Actually, pos (which is just the accumulation of
PositionIncrementAttribute, starting with -1) is the *start* node.

The end node is then pos + PositionLengthAttribute.

As far as I know, ShingleFilter is not yet graph friendly: it does not
set PositionLengthAttribute.  But you could visualize how it should be
setting it ...

Also, note that synonym filter cannot handle an incoming graph
properly, so if you run ShingleFilter before it, it's not going to do
the right thing.  For that we need something like
https://issues.apache.org/jira/browse/LUCENE-5012 ... the branch for
that issue already has a SynonymFilter that accepts incoming graphs,
but it's a biggish change.

The docs are indeed out-dated; I'll repair them.  Thank you!

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 2:41 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:
> Now it is getting more clear.
>
> "pos" (aka position) starts at "-1" and its highest number is the
> last "node id" of the graph.
>
> "pos" minus "positionLength" is the starting "node id" of the arc.
>
> Is the tokenStream after each filter always a valid graph?
>
> E.g. ShingleFilter with query "natural forest":
> SF        text start  end  positionLength  type    position
>        natural 0      7    1               word    1
> natural forest 0      14   2               shingle 1
>         forest 8      14   1               word    2
>
> (0)--- natural --->(1)--- forest --->(2)
> But how to insert the shingle into this graph?
>
> This is why I added a SynonymPreFilter to correct the graph between
> ShingleFilter and SynonymGraphFilter. But I had the wrong understanding
> of pos, positionIncrement, positionLength,...
>
>
> Another question, the API docs say "...Injecting synonyms – here,
> synonyms of a token should be added after that token..."
> But as I already mentioned the synonyms are added before the token.
> Are the docs outdated?
>
>
> Regards
> Bernd
>
>
> Am 13.02.2017 um 17:31 schrieb Michael McCandless:
>> On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>
>>> Am I confused by the naming of pos, positionIncrement, offset, positionLength,
>>> start and end between Lucene and Solr?
>>
>> "pos" is just accumulating the positionIncrement values, starting from
>> -1.  I don't think Solr's analysis UI would change the meaning of
>> these attributes.
>>
>>> OK, the SynonymGraphFilter is ONLY for Lucene, right?
>>
>> No, it's also for Solr and Elasticsearch and any other search servers
>> on top of Lucene as well.
>>
>>> But how are you going to build the multi-word synonym query "natürlicher wald"
>>> from "natural forest"?
>>
>> Lucene's and Elasticsearch's query parsers have already been fixed to
>> correctly handle token graphs by default; Solr has a fork of Lucene's
>> query parser I think ... I'm not sure if it's been fixed yet to
>> interpret graphs.
>>
>> See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
>> https://issues.apache.org/jira/browse/LUCENE-7638
>>
>>> And how are you going to highlight a synonym hit for "natürlicher wald"
>>> when start and end is set to 0-14 and not to 0-18?
>>> Or is start and end not used for highlighting?
>>
>> This start/end offset, at query time, is not normally used.  If you
>> have a document in the index that has "natürlicher wald" then it would
>> have offsets X to X+18, stored in the index ideally as postings
>> offsets, and should highlight correctly?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>>>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>>>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>>>
>>>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>>>
>>>> And when I run it, it produces the correct token graph:
>>>>
>>>> TOKEN: naturwald
>>>>   offset: 0-14
>>>>   pos: 0-4
>>>>   type: SYNONYM
>>>>
>>>> TOKEN: forêt
>>>>   offset: 0-14
>>>>   pos: 0-1
>>>>   type: SYNONYM
>>>>
>>>> TOKEN: natürlicher
>>>>   offset: 0-14
>>>>   pos: 0-2
>>>>   type: SYNONYM
>>>>
>>>> TOKEN: natural
>>>>   offset: 0-7
>>>>   pos: 0-3
>>>>   type: word
>>>>
>>>> TOKEN: naturelle
>>>>   offset: 0-14
>>>>   pos: 1-4
>>>>   type: SYNONYM
>>>>
>>>> TOKEN: wald
>>>>   offset: 0-14
>>>>   pos: 2-4
>>>>   type: SYNONYM
>>>>
>>>> TOKEN: forest
>>>>   offset: 8-14
>>>>   pos: 3-4
>>>>   type: word
>>>>
>>>> Remember that the "pos: " output above is really "node IDs" and you
>>>> can see the inserted side paths are correct.  The offsets are
>>>> necessarily always 0-14 for inserted tokens because that is the span
>>>> of the two original tokens.
>>>>
>>>> Can you try removing the SPF filters in your test?  Or otherwise
>>>> simplify your test so it's closer to what my test case is doing?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>>>> <lucene@mikemccandless.com> wrote:
>>>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>> My very simple and small sysonym_test.txt has only one line:
>>>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>>>>>
>>>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>>>>>> the result is:
>>>>>>
>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>      natural     0      7    1               word  1
>>>>>>       forest     8      14   1               word  2
>>>>>>
>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>      natural     0      7    3               word     1
>>>>>>    naturelle     0      14   3               SYNONYM  2
>>>>>>         wald     0      14   2               SYNONYM  3
>>>>>>    naturwald     0      14   4               SYNONYM  1
>>>>>>        forêt     0      14   1               SYNONYM  1
>>>>>>  natürlicher     0      14   2               SYNONYM  1
>>>>>>
>>>>>>       forest     8      14   1               word     4
>>>>>>
>>>>>> The result is some kind of rubbish.
>>>>>> Also note the empty line between "natürlicher" and "forest".
>>>>>>
>>>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>>>
>>>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>>>      First is SynonymPreFilter to set all attributes to the right
value,
>>>>>>      second is SynonymPostFilter to again fix all attribute settings
but
>>>>>>      also set multi-word synonyms as phrase and also cleanup the
result
>>>>>>      of SGF.
>>>>>>
>>>>>> Regards
>>>>>> Bernd
>>>>>>
>>>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>>>> Yeah, those tokens should have position length 2.
>>>>>>>
>>>>>>> Can you reduce to a small set of synonyms and text?  If you use
only
>>>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>> Example for position end and positionLength of SGF.
>>>>>>>>
>>>>>>>> query: natural forest
>>>>>>>>
>>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>>         natural  0      7    1               word  1
>>>>>>>>         forest   8      14   1               word  2
>>>>>>>> ...
>>>>>>>>
>>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>  natural forest  0      14   2               shingle  2
>>>>>>>>         forest   8      14   1               word     3
>>>>>>>>
>>>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>>>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>>>>>  natural forest  0      14   1               shingle  2
>>>>>>>>          forest  8      14   1               word     3
>>>>>>>>
>>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>>         natural  0      7    1               word     1
>>>>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>>>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>>>>>> "natural forest" 0      16   2               shingle  2
>>>>>>>>          forest  8      14   1               word     3
>>>>>>>>
>>>>>>>>
>>>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same
position end and positionLength.
>>>>>>>> I suppose that it is not correct?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bernd
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>>>> I tried SynonymGraphFilter with my setup and it works
right away.
>>>>>>>>>> It payed of that I did some modifications on my filters
while
>>>>>>>>>> testing 6.3 with my setup.
>>>>>>>>>
>>>>>>>>> Good!
>>>>>>>>>
>>>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter
and did not
>>>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm
that, up
>>>>>>>>>> to this point, SynonymGraphFilter is a full replacement
for
>>>>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>>>>
>>>>>>>>>> But this also means there is still some work with
the attributes, right?
>>>>>>>>>> Position looks good, type and start are no problem
anyway, but
>>>>>>>>>> the end position is still wrong and the positionLength
for multi-word
>>>>>>>>>> synonyms.
>>>>>>>>>
>>>>>>>>> Can you give an example or make a small test case?
>>>>>>>>> PositionLengthAttribute is supposed to be correct coming
out of
>>>>>>>>> SynonymGraphFilter.
>>>>>>>>>
>>>>>>>>>> One thing I noticed was that the originating token
which "produces"
>>>>>>>>>> synonyms comes out last from SynonymGraphFilter,
after the
>>>>>>>>>> "produced" synonyms.
>>>>>>>>>> I will have a look inside with debugger but I guess
this is due
>>>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>>>
>>>>>>>>> Yeah they do come out in a different order, which token
filters are
>>>>>>>>> allowed to do in general for all tokens leaving from
the same position
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> Mike McCandless
>>>>>>>>>
>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message