lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "baris.kazar" <baris.ka...@oracle.com>
Subject Re: Ignoring “de la” at index or search time
Date Sun, 24 Feb 2019 17:25:20 GMT
i guess so
what is phrase search?
c b is searched do you expect a de la b?
Thanks

> On Feb 24, 2019, at 10:49 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Not sure we’re talking about the same thing. I was talking specifically about _phrase_
searches. If all you want is the clause you just said, phrases are not involved at all and
the presence or absence of intervening words is totally unnecessary. This assumes your field
type tokenizes the input similar to the text_general field in the examples. Specifically _not_
“string” fields or fields that use KeywordTokenizer. 
> 
> q=name:(a AND b) OR name:b
> 
> for instance. With a query like that it doesn’t matter in the least whether there are,
or are not any words between “a” and “b”.
> 
> All that may be obvious to you, but when I read your latest e-mail it occurred to me
that we might not be talking about the same thing.
> 
> Best,
> Erick
> 
>> On Feb 23, 2019, at 7:33 PM, baris.kazar <baris.kazar@oracle.com> wrote:
>> 
>> In this case search string is c b
>> and then search query has 8 combos
>> including two cases with c b ~ which means find all containing c And b and c Or b
( two separate queries having ~ )
>> and then i can find a b but not a de la b without French stopwords.
>> Thanks
>> 
>>> On Feb 23, 2019, at 6:52 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>>> 
>>> Lucene won’t ignore these unless you tell it to via stopwords.
>>> 
>>> This is a problem no matter how you look at it. If you do put in stopwords, the
word _positions_ are retained. In your example,
>>> word     position
>>> a           1
>>> de         2
>>> la         3
>>> b           4
>>> 
>>> If you remove “de” and “la” via stopwords, the positions are still:
>>> 
>>> word     position
>>> a           1
>>> b           4
>>> 
>>> So searching for “a b” would fail in the second case unless you included
“slop” as
>>> “a b”~2
>>> 
>>> But let’s say you _do not_ have input with these stopwords, just “a b". The
positions
>>> will be 1 and 2 respectively. Here the user would expect “a b” to match this
doc, but
>>> not a doc with “a de la b” (unless they knew a lot about search!).
>>> 
>>> So maybe the right thing to do is let phrases have slop as a matter of course.
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>>> On Feb 23, 2019, at 11:07 AM, baris.kazar <baris.kazar@oracle.com>
wrote:
>>>> 
>>>> Thanks Erick there is a pattern i cant catch in my results such as:
>>>> a de la b
>>>> i catch “a b” though.
>>>> I though Lucene might ignore those automatically while creating index.
>>>> 
>>>> 
>>>>> On Feb 23, 2019, at 12:29 PM, Erick Erickson <erickerickson@gmail.com>
wrote:
>>>>> 
>>>>> Use stopwords, although it's becoming less of a concern, why do you think
>>>>> you need to?
>>>>> 
>>>>>> On Sat, Feb 23, 2019, 08:42 baris.kazar <baris.kazar@oracle.com>
wrote:
>>>>>> 
>>>>>> Hi,-
>>>>>> What is the (most efficient) way to
>>>>>> ignore “de la” kinda connectors
>>>>>> in a string at index or search time?
>>>>>> Thanks
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message