lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Mon, 13 Feb 2017 12:52:46 GMT
Thanks Bernd; I'll see if I can make a test case from this.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:
> My very simple and small sysonym_test.txt has only one line:
> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>
> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
> the result is:
>
> WT      text     start  end  positionLength  type  position
>      natural     0      7    1               word  1
>       forest     8      14   1               word  2
>
> SGF     text     start  end  positionLength  type     position
>      natural     0      7    3               word     1
>    naturelle     0      14   3               SYNONYM  2
>         wald     0      14   2               SYNONYM  3
>    naturwald     0      14   4               SYNONYM  1
>        forêt     0      14   1               SYNONYM  1
>  natürlicher     0      14   2               SYNONYM  1
>
>       forest     8      14   1               word     4
>
> The result is some kind of rubbish.
> Also note the empty line between "natürlicher" and "forest".
>
> Anything else I should try, may be with KeywordTokenizer?
>
> p.s. You might have noticed the SPF filters in my setup.
>      First is SynonymPreFilter to set all attributes to the right value,
>      second is SynonymPostFilter to again fix all attribute settings but
>      also set multi-word synonyms as phrase and also cleanup the result
>      of SGF.
>
> Regards
> Bernd
>
> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>> Yeah, those tokens should have position length 2.
>>
>> Can you reduce to a small set of synonyms and text?  If you use only
>> whitespace tokenizer and SGF does the issue reproduce?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>> Example for position end and positionLength of SGF.
>>>
>>> query: natural forest
>>>
>>> WT      text     start  end  positionLength  type  position
>>>         natural  0      7    1               word  1
>>>         forest   8      14   1               word  2
>>> ...
>>>
>>> SPF     text     start  end  positionLength  type     position
>>>         natural  0      7    1               word     1
>>>  natural forest  0      14   2               shingle  2
>>>         forest   8      14   1               word     3
>>>
>>> SGF     text     start  end  positionLength  type     position
>>>         natural  0      7    1               word     1
>>>       naturwald  0      14   1               SYNONYM  2
>>> forêt naturelle  0      14   1               SYNONYM  2
>>> natürlicher wald 0      14   1               SYNONYM  2
>>>  natural forest  0      14   1               shingle  2
>>>          forest  8      14   1               word     3
>>>
>>> SPF     text     start  end  positionLength  type     position
>>>         natural  0      7    1               word     1
>>>       naturwald  0      9    1               SYNONYM  2
>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>> "natural forest" 0      16   2               shingle  2
>>>          forest  8      14   1               word     3
>>>
>>>
>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end and positionLength.
>>> I suppose that it is not correct?
>>>
>>> Regards
>>> Bernd
>>>
>>>
>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>>> It payed of that I did some modifications on my filters while
>>>>> testing 6.3 with my setup.
>>>>
>>>> Good!
>>>>
>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not
>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up
>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>
>>>>> But this also means there is still some work with the attributes, right?
>>>>> Position looks good, type and start are no problem anyway, but
>>>>> the end position is still wrong and the positionLength for multi-word
>>>>> synonyms.
>>>>
>>>> Can you give an example or make a small test case?
>>>> PositionLengthAttribute is supposed to be correct coming out of
>>>> SynonymGraphFilter.
>>>>
>>>>> One thing I noticed was that the originating token which "produces"
>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>> "produced" synonyms.
>>>>> I will have a look inside with debugger but I guess this is due
>>>>> to output buffering of SynonymGraphFilter?
>>>>
>>>> Yeah they do come out in a different order, which token filters are
>>>> allowed to do in general for all tokens leaving from the same position
>>>> ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --
>>> *************************************************************
>>> Bernd Fehling                    Bielefeld University Library
>>> Dipl.-Inform. (FH)                LibTec - Library Technology
>>> Universitätsstr. 25                  and Knowledge Management
>>> 33615 Bielefeld
>>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>>
>>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>>> *************************************************************
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message