lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Ignoring “de la” at index or search time
Date Mon, 25 Feb 2019 02:21:36 GMT
Case 1. Stopwords are irrelevant. If you sreach field:(a AND b) you're
asking if both appear in the field, and that's the only question. It
doesn't matter what other words are in the field. It doesn't matter whether
they're close to each other.

Case 2. Yep.

On Sun, Feb 24, 2019, 17:02 baris.kazar <baris.kazar@oracle.com> wrote:

> There is PhraseQuery, too, but lets consider two cases:
>
> case1: that PhraseQuery is not being used:
> then should i add to standard filter’s stopwords also the french stopwords
> both at index & search times? can i just add them at search time and keep
> old friends index as it is?
>
> case2: that PhraseQuery being used:
> i guess i need to play with the “slops” and stopwords in this case will
> not help, right?
>
> Thanks
>
> > On Feb 24, 2019, at 2:25 PM, baris.kazar <baris.kazar@oracle.com> wrote:
> >
> > That is not what i am looking for. Thanks.
> >
> > c b search string finds
> > a b
> > but how cant find
> > a de la b
> > so i will try french stopwords.
> > Doing that i am using 8 queries like the ones i mentioned.
> > Best
> >
> >> On Feb 24, 2019, at 1:19 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >>
> >> Phrase search is looking for words next to each other. A phrase search
> on the text “my dog has fleas” would succeed for “my dog” or “has fleas”
> but not “my fleas” since the words are not right next to each other. “my
> fleas”~3 would succeed because the “~3” indicates that the words can have
> intervening terms.
> >>
> >> Searching (dog AND fleas) would match no matter how many words were
> between the two.
> >>
> >> If you’re unclear about what phrase search .vs. non-phrase search
> means, some background research/ self-education are strongly recommended,
> such basic understanding of search is pretty much assumed.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Feb 24, 2019, at 9:25 AM, baris.kazar <baris.kazar@oracle.com>
> wrote:
> >>>
> >>> i guess so
> >>> what is phrase search?
> >>> c b is searched do you expect a de la b?
> >>> Thanks
> >>>
> >>>> On Feb 24, 2019, at 10:49 AM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >>>>
> >>>> Not sure we’re talking about the same thing. I was talking
> specifically about _phrase_ searches. If all you want is the clause you
> just said, phrases are not involved at all and the presence or absence of
> intervening words is totally unnecessary. This assumes your field type
> tokenizes the input similar to the text_general field in the examples.
> Specifically _not_ “string” fields or fields that use KeywordTokenizer.
> >>>>
> >>>> q=name:(a AND b) OR name:b
> >>>>
> >>>> for instance. With a query like that it doesn’t matter in the least
> whether there are, or are not any words between “a” and “b”.
> >>>>
> >>>> All that may be obvious to you, but when I read your latest e-mail it
> occurred to me that we might not be talking about the same thing.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>> On Feb 23, 2019, at 7:33 PM, baris.kazar <baris.kazar@oracle.com>
> wrote:
> >>>>>
> >>>>> In this case search string is c b
> >>>>> and then search query has 8 combos
> >>>>> including two cases with c b ~ which means find all containing c
And
> b and c Or b ( two separate queries having ~ )
> >>>>> and then i can find a b but not a de la b without French stopwords.
> >>>>> Thanks
> >>>>>
> >>>>>> On Feb 23, 2019, at 6:52 PM, Erick Erickson <
> erickerickson@gmail.com> wrote:
> >>>>>>
> >>>>>> Lucene won’t ignore these unless you tell it to via stopwords.
> >>>>>>
> >>>>>> This is a problem no matter how you look at it. If you do put
in
> stopwords, the word _positions_ are retained. In your example,
> >>>>>> word     position
> >>>>>> a           1
> >>>>>> de         2
> >>>>>> la         3
> >>>>>> b           4
> >>>>>>
> >>>>>> If you remove “de” and “la” via stopwords, the positions
are still:
> >>>>>>
> >>>>>> word     position
> >>>>>> a           1
> >>>>>> b           4
> >>>>>>
> >>>>>> So searching for “a b” would fail in the second case unless
you
> included “slop” as
> >>>>>> “a b”~2
> >>>>>>
> >>>>>> But let’s say you _do not_ have input with these stopwords,
just “a
> b". The positions
> >>>>>> will be 1 and 2 respectively. Here the user would expect “a
b” to
> match this doc, but
> >>>>>> not a doc with “a de la b” (unless they knew a lot about
search!).
> >>>>>>
> >>>>>> So maybe the right thing to do is let phrases have slop as a
matter
> of course.
> >>>>>>
> >>>>>> Best,
> >>>>>> Erick
> >>>>>>
> >>>>>>
> >>>>>>> On Feb 23, 2019, at 11:07 AM, baris.kazar <baris.kazar@oracle.com>
> wrote:
> >>>>>>>
> >>>>>>> Thanks Erick there is a pattern i cant catch in my results
such as:
> >>>>>>> a de la b
> >>>>>>> i catch “a b” though.
> >>>>>>> I though Lucene might ignore those automatically while creating
> index.
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Feb 23, 2019, at 12:29 PM, Erick Erickson <
> erickerickson@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Use stopwords, although it's becoming less of a concern,
why do
> you think
> >>>>>>>> you need to?
> >>>>>>>>
> >>>>>>>>> On Sat, Feb 23, 2019, 08:42 baris.kazar <baris.kazar@oracle.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,-
> >>>>>>>>> What is the (most efficient) way to
> >>>>>>>>> ignore “de la” kinda connectors
> >>>>>>>>> in a string at index or search time?
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message