lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Tue, 14 Feb 2017 07:41:26 GMT
Now it is getting more clear.

"pos" (aka position) starts at "-1" and its highest number is the
last "node id" of the graph.

"pos" minus "positionLength" is the starting "node id" of the arc.

Is the tokenStream after each filter always a valid graph?

E.g. ShingleFilter with query "natural forest":
SF        text start  end  positionLength  type    position
       natural 0      7    1               word    1
natural forest 0      14   2               shingle 1	
        forest 8      14   1               word    2

(0)--- natural --->(1)--- forest --->(2)
But how to insert the shingle into this graph?

This is why I added a SynonymPreFilter to correct the graph between
ShingleFilter and SynonymGraphFilter. But I had the wrong understanding
of pos, positionIncrement, positionLength,...


Another question, the API docs say "...Injecting synonyms – here,
synonyms of a token should be added after that token..."
But as I already mentioned the synonyms are added before the token.
Are the docs outdated?


Regards
Bernd


Am 13.02.2017 um 17:31 schrieb Michael McCandless:
> On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Am I confused by the naming of pos, positionIncrement, offset, positionLength,
>> start and end between Lucene and Solr?
> 
> "pos" is just accumulating the positionIncrement values, starting from
> -1.  I don't think Solr's analysis UI would change the meaning of
> these attributes.
> 
>> OK, the SynonymGraphFilter is ONLY for Lucene, right?
> 
> No, it's also for Solr and Elasticsearch and any other search servers
> on top of Lucene as well.
> 
>> But how are you going to build the multi-word synonym query "natürlicher wald"
>> from "natural forest"?
> 
> Lucene's and Elasticsearch's query parsers have already been fixed to
> correctly handle token graphs by default; Solr has a fork of Lucene's
> query parser I think ... I'm not sure if it's been fixed yet to
> interpret graphs.
> 
> See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
> https://issues.apache.org/jira/browse/LUCENE-7638
> 
>> And how are you going to highlight a synonym hit for "natürlicher wald"
>> when start and end is set to 0-14 and not to 0-18?
>> Or is start and end not used for highlighting?
> 
> This start/end offset, at query time, is not normally used.  If you
> have a document in the index that has "natürlicher wald" then it would
> have offsets X to X+18, stored in the index ideally as postings
> offsets, and should highlight correctly?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
>> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>>
>>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>>
>>> And when I run it, it produces the correct token graph:
>>>
>>> TOKEN: naturwald
>>>   offset: 0-14
>>>   pos: 0-4
>>>   type: SYNONYM
>>>
>>> TOKEN: forêt
>>>   offset: 0-14
>>>   pos: 0-1
>>>   type: SYNONYM
>>>
>>> TOKEN: natürlicher
>>>   offset: 0-14
>>>   pos: 0-2
>>>   type: SYNONYM
>>>
>>> TOKEN: natural
>>>   offset: 0-7
>>>   pos: 0-3
>>>   type: word
>>>
>>> TOKEN: naturelle
>>>   offset: 0-14
>>>   pos: 1-4
>>>   type: SYNONYM
>>>
>>> TOKEN: wald
>>>   offset: 0-14
>>>   pos: 2-4
>>>   type: SYNONYM
>>>
>>> TOKEN: forest
>>>   offset: 8-14
>>>   pos: 3-4
>>>   type: word
>>>
>>> Remember that the "pos: " output above is really "node IDs" and you
>>> can see the inserted side paths are correct.  The offsets are
>>> necessarily always 0-14 for inserted tokens because that is the span
>>> of the two original tokens.
>>>
>>> Can you try removing the SPF filters in your test?  Or otherwise
>>> simplify your test so it's closer to what my test case is doing?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>>> <lucene@mikemccandless.com> wrote:
>>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>> My very simple and small sysonym_test.txt has only one line:
>>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>>>>
>>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>>>>> the result is:
>>>>>
>>>>> WT      text     start  end  positionLength  type  position
>>>>>      natural     0      7    1               word  1
>>>>>       forest     8      14   1               word  2
>>>>>
>>>>> SGF     text     start  end  positionLength  type     position
>>>>>      natural     0      7    3               word     1
>>>>>    naturelle     0      14   3               SYNONYM  2
>>>>>         wald     0      14   2               SYNONYM  3
>>>>>    naturwald     0      14   4               SYNONYM  1
>>>>>        forêt     0      14   1               SYNONYM  1
>>>>>  natürlicher     0      14   2               SYNONYM  1
>>>>>
>>>>>       forest     8      14   1               word     4
>>>>>
>>>>> The result is some kind of rubbish.
>>>>> Also note the empty line between "natürlicher" and "forest".
>>>>>
>>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>>
>>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>>      First is SynonymPreFilter to set all attributes to the right value,
>>>>>      second is SynonymPostFilter to again fix all attribute settings
but
>>>>>      also set multi-word synonyms as phrase and also cleanup the result
>>>>>      of SGF.
>>>>>
>>>>> Regards
>>>>> Bernd
>>>>>
>>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>>> Yeah, those tokens should have position length 2.
>>>>>>
>>>>>> Can you reduce to a small set of synonyms and text?  If you use only
>>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>> Example for position end and positionLength of SGF.
>>>>>>>
>>>>>>> query: natural forest
>>>>>>>
>>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>>         natural  0      7    1               word  1
>>>>>>>         forest   8      14   1               word  2
>>>>>>> ...
>>>>>>>
>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>         natural  0      7    1               word     1
>>>>>>>  natural forest  0      14   2               shingle  2
>>>>>>>         forest   8      14   1               word     3
>>>>>>>
>>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>>         natural  0      7    1               word     1
>>>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>>>>  natural forest  0      14   1               shingle  2
>>>>>>>          forest  8      14   1               word     3
>>>>>>>
>>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>>         natural  0      7    1               word     1
>>>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>>>>> "natural forest" 0      16   2               shingle  2
>>>>>>>          forest  8      14   1               word     3
>>>>>>>
>>>>>>>
>>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position
end and positionLength.
>>>>>>> I suppose that it is not correct?
>>>>>>>
>>>>>>> Regards
>>>>>>> Bernd
>>>>>>>
>>>>>>>
>>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>>> I tried SynonymGraphFilter with my setup and it works
right away.
>>>>>>>>> It payed of that I did some modifications on my filters
while
>>>>>>>>> testing 6.3 with my setup.
>>>>>>>>
>>>>>>>> Good!
>>>>>>>>
>>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter
and did not
>>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm
that, up
>>>>>>>>> to this point, SynonymGraphFilter is a full replacement
for
>>>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>>>
>>>>>>>>> But this also means there is still some work with the
attributes, right?
>>>>>>>>> Position looks good, type and start are no problem anyway,
but
>>>>>>>>> the end position is still wrong and the positionLength
for multi-word
>>>>>>>>> synonyms.
>>>>>>>>
>>>>>>>> Can you give an example or make a small test case?
>>>>>>>> PositionLengthAttribute is supposed to be correct coming
out of
>>>>>>>> SynonymGraphFilter.
>>>>>>>>
>>>>>>>>> One thing I noticed was that the originating token which
"produces"
>>>>>>>>> synonyms comes out last from SynonymGraphFilter, after
the
>>>>>>>>> "produced" synonyms.
>>>>>>>>> I will have a look inside with debugger but I guess this
is due
>>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>>
>>>>>>>> Yeah they do come out in a different order, which token filters
are
>>>>>>>> allowed to do in general for all tokens leaving from the
same position
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message