lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Mon, 13 Feb 2017 16:31:50 GMT
On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:

> Am I confused by the naming of pos, positionIncrement, offset, positionLength,
> start and end between Lucene and Solr?

"pos" is just accumulating the positionIncrement values, starting from
-1.  I don't think Solr's analysis UI would change the meaning of
these attributes.

> OK, the SynonymGraphFilter is ONLY for Lucene, right?

No, it's also for Solr and Elasticsearch and any other search servers
on top of Lucene as well.

> But how are you going to build the multi-word synonym query "natürlicher wald"
> from "natural forest"?

Lucene's and Elasticsearch's query parsers have already been fixed to
correctly handle token graphs by default; Solr has a fork of Lucene's
query parser I think ... I'm not sure if it's been fixed yet to
interpret graphs.

See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
https://issues.apache.org/jira/browse/LUCENE-7638

> And how are you going to highlight a synonym hit for "natürlicher wald"
> when start and end is set to 0-14 and not to 0-18?
> Or is start and end not used for highlighting?

This start/end offset, at query time, is not normally used.  If you
have a document in the index that has "natürlicher wald" then it would
have offsets X to X+18, stored in the index ideally as postings
offsets, and should highlight correctly?

Mike McCandless

http://blog.mikemccandless.com

> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>
>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>
>> And when I run it, it produces the correct token graph:
>>
>> TOKEN: naturwald
>>   offset: 0-14
>>   pos: 0-4
>>   type: SYNONYM
>>
>> TOKEN: forêt
>>   offset: 0-14
>>   pos: 0-1
>>   type: SYNONYM
>>
>> TOKEN: natürlicher
>>   offset: 0-14
>>   pos: 0-2
>>   type: SYNONYM
>>
>> TOKEN: natural
>>   offset: 0-7
>>   pos: 0-3
>>   type: word
>>
>> TOKEN: naturelle
>>   offset: 0-14
>>   pos: 1-4
>>   type: SYNONYM
>>
>> TOKEN: wald
>>   offset: 0-14
>>   pos: 2-4
>>   type: SYNONYM
>>
>> TOKEN: forest
>>   offset: 8-14
>>   pos: 3-4
>>   type: word
>>
>> Remember that the "pos: " output above is really "node IDs" and you
>> can see the inserted side paths are correct.  The offsets are
>> necessarily always 0-14 for inserted tokens because that is the span
>> of the two original tokens.
>>
>> Can you try removing the SPF filters in your test?  Or otherwise
>> simplify your test so it's closer to what my test case is doing?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>> My very simple and small sysonym_test.txt has only one line:
>>>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>>>
>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>>>> the result is:
>>>>
>>>> WT      text     start  end  positionLength  type  position
>>>>      natural     0      7    1               word  1
>>>>       forest     8      14   1               word  2
>>>>
>>>> SGF     text     start  end  positionLength  type     position
>>>>      natural     0      7    3               word     1
>>>>    naturelle     0      14   3               SYNONYM  2
>>>>         wald     0      14   2               SYNONYM  3
>>>>    naturwald     0      14   4               SYNONYM  1
>>>>        forêt     0      14   1               SYNONYM  1
>>>>  natürlicher     0      14   2               SYNONYM  1
>>>>
>>>>       forest     8      14   1               word     4
>>>>
>>>> The result is some kind of rubbish.
>>>> Also note the empty line between "natürlicher" and "forest".
>>>>
>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>
>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>      First is SynonymPreFilter to set all attributes to the right value,
>>>>      second is SynonymPostFilter to again fix all attribute settings but
>>>>      also set multi-word synonyms as phrase and also cleanup the result
>>>>      of SGF.
>>>>
>>>> Regards
>>>> Bernd
>>>>
>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>> Yeah, those tokens should have position length 2.
>>>>>
>>>>> Can you reduce to a small set of synonyms and text?  If you use only
>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>> Example for position end and positionLength of SGF.
>>>>>>
>>>>>> query: natural forest
>>>>>>
>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>         natural  0      7    1               word  1
>>>>>>         forest   8      14   1               word  2
>>>>>> ...
>>>>>>
>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>  natural forest  0      14   2               shingle  2
>>>>>>         forest   8      14   1               word     3
>>>>>>
>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>>>  natural forest  0      14   1               shingle  2
>>>>>>          forest  8      14   1               word     3
>>>>>>
>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>>>> "natural forest" 0      16   2               shingle  2
>>>>>>          forest  8      14   1               word     3
>>>>>>
>>>>>>
>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position
end and positionLength.
>>>>>> I suppose that it is not correct?
>>>>>>
>>>>>> Regards
>>>>>> Bernd
>>>>>>
>>>>>>
>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>> I tried SynonymGraphFilter with my setup and it works right
away.
>>>>>>>> It payed of that I did some modifications on my filters while
>>>>>>>> testing 6.3 with my setup.
>>>>>>>
>>>>>>> Good!
>>>>>>>
>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and
did not
>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that,
up
>>>>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>>
>>>>>>>> But this also means there is still some work with the attributes,
right?
>>>>>>>> Position looks good, type and start are no problem anyway,
but
>>>>>>>> the end position is still wrong and the positionLength for
multi-word
>>>>>>>> synonyms.
>>>>>>>
>>>>>>> Can you give an example or make a small test case?
>>>>>>> PositionLengthAttribute is supposed to be correct coming out
of
>>>>>>> SynonymGraphFilter.
>>>>>>>
>>>>>>>> One thing I noticed was that the originating token which
"produces"
>>>>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>>>>> "produced" synonyms.
>>>>>>>> I will have a look inside with debugger but I guess this
is due
>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>
>>>>>>> Yeah they do come out in a different order, which token filters
are
>>>>>>> allowed to do in general for all tokens leaving from the same
position
>>>>>>> ...
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message