lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Mon, 13 Feb 2017 14:04:43 GMT
After drawing the graph I must admit it looks correct, including all values.

Am I confused by the naming of pos, positionIncrement, offset, positionLength,
start and end between Lucene and Solr?

OK, the SynonymGraphFilter is ONLY for Lucene, right?

But how are you going to build the multi-word synonym query "natürlicher wald"
from "natural forest"?

And how are you going to highlight a synonym hit for "natürlicher wald"
when start and end is set to 0-14 and not to 0-18?
Or is start and end not used for highlighting?

Regards
Bernd

Am 13.02.2017 um 14:24 schrieb Michael McCandless:
> Unfortunately, I cannot reproduce the problem with a straight Lucene
> test case.  I added a this test case to TestSynonymGraphFilter.java:
> 
>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
> 
> And when I run it, it produces the correct token graph:
> 
> TOKEN: naturwald
>   offset: 0-14
>   pos: 0-4
>   type: SYNONYM
> 
> TOKEN: forêt
>   offset: 0-14
>   pos: 0-1
>   type: SYNONYM
> 
> TOKEN: natürlicher
>   offset: 0-14
>   pos: 0-2
>   type: SYNONYM
> 
> TOKEN: natural
>   offset: 0-7
>   pos: 0-3
>   type: word
> 
> TOKEN: naturelle
>   offset: 0-14
>   pos: 1-4
>   type: SYNONYM
> 
> TOKEN: wald
>   offset: 0-14
>   pos: 2-4
>   type: SYNONYM
> 
> TOKEN: forest
>   offset: 8-14
>   pos: 3-4
>   type: word
> 
> Remember that the "pos: " output above is really "node IDs" and you
> can see the inserted side paths are correct.  The offsets are
> necessarily always 0-14 for inserted tokens because that is the span
> of the two original tokens.
> 
> Can you try removing the SPF filters in your test?  Or otherwise
> simplify your test so it's closer to what my test case is doing?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> Thanks Bernd; I'll see if I can make a test case from this.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>> My very simple and small sysonym_test.txt has only one line:
>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>>
>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>>> the result is:
>>>
>>> WT      text     start  end  positionLength  type  position
>>>      natural     0      7    1               word  1
>>>       forest     8      14   1               word  2
>>>
>>> SGF     text     start  end  positionLength  type     position
>>>      natural     0      7    3               word     1
>>>    naturelle     0      14   3               SYNONYM  2
>>>         wald     0      14   2               SYNONYM  3
>>>    naturwald     0      14   4               SYNONYM  1
>>>        forêt     0      14   1               SYNONYM  1
>>>  natürlicher     0      14   2               SYNONYM  1
>>>
>>>       forest     8      14   1               word     4
>>>
>>> The result is some kind of rubbish.
>>> Also note the empty line between "natürlicher" and "forest".
>>>
>>> Anything else I should try, may be with KeywordTokenizer?
>>>
>>> p.s. You might have noticed the SPF filters in my setup.
>>>      First is SynonymPreFilter to set all attributes to the right value,
>>>      second is SynonymPostFilter to again fix all attribute settings but
>>>      also set multi-word synonyms as phrase and also cleanup the result
>>>      of SGF.
>>>
>>> Regards
>>> Bernd
>>>
>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>> Yeah, those tokens should have position length 2.
>>>>
>>>> Can you reduce to a small set of synonyms and text?  If you use only
>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>> Example for position end and positionLength of SGF.
>>>>>
>>>>> query: natural forest
>>>>>
>>>>> WT      text     start  end  positionLength  type  position
>>>>>         natural  0      7    1               word  1
>>>>>         forest   8      14   1               word  2
>>>>> ...
>>>>>
>>>>> SPF     text     start  end  positionLength  type     position
>>>>>         natural  0      7    1               word     1
>>>>>  natural forest  0      14   2               shingle  2
>>>>>         forest   8      14   1               word     3
>>>>>
>>>>> SGF     text     start  end  positionLength  type     position
>>>>>         natural  0      7    1               word     1
>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>>  natural forest  0      14   1               shingle  2
>>>>>          forest  8      14   1               word     3
>>>>>
>>>>> SPF     text     start  end  positionLength  type     position
>>>>>         natural  0      7    1               word     1
>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>>> "natural forest" 0      16   2               shingle  2
>>>>>          forest  8      14   1               word     3
>>>>>
>>>>>
>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end
and positionLength.
>>>>> I suppose that it is not correct?
>>>>>
>>>>> Regards
>>>>> Bernd
>>>>>
>>>>>
>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>>>>> It payed of that I did some modifications on my filters while
>>>>>>> testing 6.3 with my setup.
>>>>>>
>>>>>> Good!
>>>>>>
>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did
not
>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that,
up
>>>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>
>>>>>>> But this also means there is still some work with the attributes,
right?
>>>>>>> Position looks good, type and start are no problem anyway, but
>>>>>>> the end position is still wrong and the positionLength for multi-word
>>>>>>> synonyms.
>>>>>>
>>>>>> Can you give an example or make a small test case?
>>>>>> PositionLengthAttribute is supposed to be correct coming out of
>>>>>> SynonymGraphFilter.
>>>>>>
>>>>>>> One thing I noticed was that the originating token which "produces"
>>>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>>>> "produced" synonyms.
>>>>>>> I will have a look inside with debugger but I guess this is due
>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>
>>>>>> Yeah they do come out in a different order, which token filters are
>>>>>> allowed to do in general for all tokens leaving from the same position
>>>>>> ...
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message