lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shifflett, David [USA]" <Shifflett_Da...@bah.com>
Subject Re: How to ignore certain words based on query specifics
Date Tue, 09 Jul 2019 15:29:24 GMT
Sorry for the weird reply path, but I couldn’t find an easy reply method via the list archive.

Anyway …

The use case is as follows:
Allow the user to specify queries such as ‘free*’
and also include similar words to be ignored, such as freedom.
Another example would be ‘secret*’ and secretary.

I want to keep the ignore words separate so they apply to all queries,
but then realized the ignore words should only apply to relevant (matching) queries.

I don’t want the users to be required to add ‘and not WORD’ many times to each of the
listed queries.

David Shifflett

From: Diego Ceccarelli

Could you please describe the use case? maybe there is an easier solution



From: "Shifflett, David [USA]" <Shifflett_David@bah.com>
Date: Tuesday, July 9, 2019 at 8:02 AM
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Subject: How to ignore certain words based on query specifics

Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ to
that query
because that could produce false negatives for documents containing free and liberty.

I think what I need to do is:
for each query
  for each ignore word
    if the query would match the ignore word,
      add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore words into
an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message