lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "baris.kazar" <baris.ka...@oracle.com>
Subject Re: Ignoring “de la” at index or search time
Date Mon, 25 Feb 2019 01:02:02 GMT
There is PhraseQuery, too, but lets consider two cases:

case1: that PhraseQuery is not being used:
then should i add to standard filter’s stopwords also the french stopwords both at index
& search times? can i just add them at search time and keep old friends index as it is?

case2: that PhraseQuery being used:
i guess i need to play with the “slops” and stopwords in this case will not help, right?

Thanks

> On Feb 24, 2019, at 2:25 PM, baris.kazar <baris.kazar@oracle.com> wrote:
> 
> That is not what i am looking for. Thanks.
> 
> c b search string finds
> a b 
> but how cant find 
> a de la b
> so i will try french stopwords.
> Doing that i am using 8 queries like the ones i mentioned.
> Best
> 
>> On Feb 24, 2019, at 1:19 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>> 
>> Phrase search is looking for words next to each other. A phrase search on the text
“my dog has fleas” would succeed for “my dog” or “has fleas” but not “my fleas”
since the words are not right next to each other. “my fleas”~3 would succeed because the
“~3” indicates that the words can have intervening terms.
>> 
>> Searching (dog AND fleas) would match no matter how many words were between the two.
>> 
>> If you’re unclear about what phrase search .vs. non-phrase search means, some background
research/ self-education are strongly recommended, such basic understanding of search is pretty
much assumed.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 24, 2019, at 9:25 AM, baris.kazar <baris.kazar@oracle.com> wrote:
>>> 
>>> i guess so
>>> what is phrase search?
>>> c b is searched do you expect a de la b?
>>> Thanks
>>> 
>>>> On Feb 24, 2019, at 10:49 AM, Erick Erickson <erickerickson@gmail.com>
wrote:
>>>> 
>>>> Not sure we’re talking about the same thing. I was talking specifically
about _phrase_ searches. If all you want is the clause you just said, phrases are not involved
at all and the presence or absence of intervening words is totally unnecessary. This assumes
your field type tokenizes the input similar to the text_general field in the examples. Specifically
_not_ “string” fields or fields that use KeywordTokenizer. 
>>>> 
>>>> q=name:(a AND b) OR name:b
>>>> 
>>>> for instance. With a query like that it doesn’t matter in the least whether
there are, or are not any words between “a” and “b”.
>>>> 
>>>> All that may be obvious to you, but when I read your latest e-mail it occurred
to me that we might not be talking about the same thing.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Feb 23, 2019, at 7:33 PM, baris.kazar <baris.kazar@oracle.com>
wrote:
>>>>> 
>>>>> In this case search string is c b
>>>>> and then search query has 8 combos
>>>>> including two cases with c b ~ which means find all containing c And
b and c Or b ( two separate queries having ~ )
>>>>> and then i can find a b but not a de la b without French stopwords.
>>>>> Thanks
>>>>> 
>>>>>> On Feb 23, 2019, at 6:52 PM, Erick Erickson <erickerickson@gmail.com>
wrote:
>>>>>> 
>>>>>> Lucene won’t ignore these unless you tell it to via stopwords.
>>>>>> 
>>>>>> This is a problem no matter how you look at it. If you do put in
stopwords, the word _positions_ are retained. In your example,
>>>>>> word     position
>>>>>> a           1
>>>>>> de         2
>>>>>> la         3
>>>>>> b           4
>>>>>> 
>>>>>> If you remove “de” and “la” via stopwords, the positions
are still:
>>>>>> 
>>>>>> word     position
>>>>>> a           1
>>>>>> b           4
>>>>>> 
>>>>>> So searching for “a b” would fail in the second case unless you
included “slop” as
>>>>>> “a b”~2
>>>>>> 
>>>>>> But let’s say you _do not_ have input with these stopwords, just
“a b". The positions
>>>>>> will be 1 and 2 respectively. Here the user would expect “a b”
to match this doc, but
>>>>>> not a doc with “a de la b” (unless they knew a lot about search!).
>>>>>> 
>>>>>> So maybe the right thing to do is let phrases have slop as a matter
of course.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> 
>>>>>>> On Feb 23, 2019, at 11:07 AM, baris.kazar <baris.kazar@oracle.com>
wrote:
>>>>>>> 
>>>>>>> Thanks Erick there is a pattern i cant catch in my results such
as:
>>>>>>> a de la b
>>>>>>> i catch “a b” though.
>>>>>>> I though Lucene might ignore those automatically while creating
index.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Feb 23, 2019, at 12:29 PM, Erick Erickson <erickerickson@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>> Use stopwords, although it's becoming less of a concern,
why do you think
>>>>>>>> you need to?
>>>>>>>> 
>>>>>>>>> On Sat, Feb 23, 2019, 08:42 baris.kazar <baris.kazar@oracle.com>
wrote:
>>>>>>>>> 
>>>>>>>>> Hi,-
>>>>>>>>> What is the (most efficient) way to
>>>>>>>>> ignore “de la” kinda connectors
>>>>>>>>> in a string at index or search time?
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message