lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Mon, 13 Feb 2017 10:00:13 GMT
My very simple and small sysonym_test.txt has only one line:
naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald

If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
the result is:

WT      text     start  end  positionLength  type  position
     natural     0      7    1               word  1
      forest     8      14   1               word  2

SGF     text     start  end  positionLength  type     position
     natural     0      7    3               word     1	
   naturelle     0      14   3               SYNONYM  2	
        wald     0      14   2               SYNONYM  3	
   naturwald     0      14   4               SYNONYM  1	
       forêt     0      14   1               SYNONYM  1	
 natürlicher     0      14   2               SYNONYM  1	

      forest     8      14   1               word     4

The result is some kind of rubbish.
Also note the empty line between "natürlicher" and "forest".

Anything else I should try, may be with KeywordTokenizer?

p.s. You might have noticed the SPF filters in my setup.
     First is SynonymPreFilter to set all attributes to the right value,
     second is SynonymPostFilter to again fix all attribute settings but
     also set multi-word synonyms as phrase and also cleanup the result
     of SGF.

Regards
Bernd

Am 11.02.2017 um 00:45 schrieb Michael McCandless:
> Yeah, those tokens should have position length 2.
> 
> Can you reduce to a small set of synonyms and text?  If you use only
> whitespace tokenizer and SGF does the issue reproduce?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de> wrote:
>> Example for position end and positionLength of SGF.
>>
>> query: natural forest
>>
>> WT      text     start  end  positionLength  type  position
>>         natural  0      7    1               word  1
>>         forest   8      14   1               word  2
>> ...
>>
>> SPF     text     start  end  positionLength  type     position
>>         natural  0      7    1               word     1
>>  natural forest  0      14   2               shingle  2
>>         forest   8      14   1               word     3
>>
>> SGF     text     start  end  positionLength  type     position
>>         natural  0      7    1               word     1
>>       naturwald  0      14   1               SYNONYM  2
>> forêt naturelle  0      14   1               SYNONYM  2
>> natürlicher wald 0      14   1               SYNONYM  2
>>  natural forest  0      14   1               shingle  2
>>          forest  8      14   1               word     3
>>
>> SPF     text     start  end  positionLength  type     position
>>         natural  0      7    1               word     1
>>       naturwald  0      9    1               SYNONYM  2
>> "forêt naturelle"  0    17   2               SYNONYM  2
>> "natürlicher wald" 0    18   2               SYNONYM  2
>> "natural forest" 0      16   2               shingle  2
>>          forest  8      14   1               word     3
>>
>>
>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end and positionLength.
>> I suppose that it is not correct?
>>
>> Regards
>> Bernd
>>
>>
>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>> It payed of that I did some modifications on my filters while
>>>> testing 6.3 with my setup.
>>>
>>> Good!
>>>
>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not
>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up
>>>> to this point, SynonymGraphFilter is a full replacement for
>>>> SynonymFilter. At least for search-time synonym handling.
>>>>
>>>> But this also means there is still some work with the attributes, right?
>>>> Position looks good, type and start are no problem anyway, but
>>>> the end position is still wrong and the positionLength for multi-word
>>>> synonyms.
>>>
>>> Can you give an example or make a small test case?
>>> PositionLengthAttribute is supposed to be correct coming out of
>>> SynonymGraphFilter.
>>>
>>>> One thing I noticed was that the originating token which "produces"
>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>> "produced" synonyms.
>>>> I will have a look inside with debugger but I guess this is due
>>>> to output buffering of SynonymGraphFilter?
>>>
>>> Yeah they do come out in a different order, which token filters are
>>> allowed to do in general for all tokens leaving from the same position
>>> ...
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universitätsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message