lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <morus.wal...@gmx.de>
Subject Re: Query Parser AND / OR
Date Tue, 30 Dec 2003 22:19:38 GMT
Hi Dror,

> > I was thinking about this issue, and currently I think that the only way to 
> > define this type of queries formally, is to give the default operator it's own
> > precedence relativly to the precedence of 'OR' and 'AND'.
> > So there are two possibilities:
> > either the default operator has higher precedence than 'AND' or lower than 
> > 'OR'.
> > For default OR in the first case
> > `a OR b c +d' would be equal to `(a OR b) c +d' == (a b) c +d
> > in the second to `a OR (b c +d)' == a (b c +d) 
> > For default AND one has `+(a b) +c +d' and `a (+b +c +d)'
> > 
> > (a b) c +d searches all documents containing d, occurences of a, b and c 
> > influence scoring
> > a (b c +d) searches documents containing `a' joined with documents 
> > containing `d' (where b and c influcence scoring)
> > Now, what's closer to what one might have meant by `a OR b c +d'?
> > 
> > +(a b) +c +d searches documents containing c, d and either a or b.
> > a (+b +c +d) searches documents containing a or each of b, c and d.
> 
> I don't think this is a good idea. Mostly because it would be hard to
> explain/document, and you don't want end users to have to think and read
> a lot of documentation when doing a search.
> 
> For one thing, I would advocate for using the '+' notation as the
> underlying syntax and migrating to boolean operators since that's many
> more people are used to that syntax, and I believe it's better
> understood.
> 
I'm not sure if I understand what you mean here.

> > 
> > The other alternative would be to forbid queries mixing default operators and
> > explicit and/or. This is what I'd probably vote for at the moment.
> 
> At first I was inclined to agree but as a rule I think we should adopt
> the WWGD (What Would Google Do) philosophy, since that's the syntax and
> behavior that most people are used to.
> 
> It looks like it basically adds an "AND" between any two terms that
> don't have operator between them. We could do the same for both the
> default AND and the default OR. Once you've done that, you just use the
> standard boolean logic precedence rule.
> 
Hmm. Then you loose the possibility to create BooleanQuery-objects where
some of the terms are required some forbidden and some have neither flag.
To have this possibility is the reason why I say that implicit AND/OR and 
explicit AND/OR need to be different things.
If an implicit OR equals an explicit OR, you would have '+a +b' = '+a OR +b' 
= '(+a) OR (+b)' = 'a OR b' which is probably not, what was intended.
So either the '+' operator is removed or it is used as an alternative to AND
in which case it could not be a prefix. So instead of '+a +b' one would use 
'a + b'.

A consequence of pure boolean operators is, that there won't be a way of 
serializing an arbitray query to a parsable string in standard query parser 
syntax.

So for completeness and compatibility with the current query parser, I would 
keep the current behaviour of queries without explicit boolean operators.

The problem for users isn't that big IMHO.
Unless a user decides to make use of the '+' operator things are pretty clear:
a b c searches for documents containing one or all of these terms (depending
on the default operator). Using terms with the '-' operator also does what 
one expects. Only if the user starts to use the '+' operator explicitly,
things are getting more complicated. So he just shouldn't do that unless
he knows what he does.
The same thing applies to queries using AND/OR as long as you don't mix it
with implicit operators. IMO whoever does the latter get's what he deserves,
if he has to deal with the difficulties of such queries. One just should
not do that, and it should be pretty clear, that the meaning of such a query
is unclear (unless parenthesis are used, in which case there is no mixing
any longer).
That is, why I think my patch is good enough, even if it leaves the evaluation
of such queries without clear definition.

> Now the good news on all of this is that it seems (I did a small test),
> that if you use parenthesis the parser does the right thing. In my mind,
> it's a good idea to use parenthesis whenever you're creating complex
> expressions.
> 
Sure. All we are talking about is what happens if there are no explicit
parenthesis. If you use parentheses you break the query into simple parts 
(e.g. (a AND b) OR (c AND d) are two queries of type 'x AND y' and one
query of typ 'x OR y' (where x and y are queries, not just terms)), which
are handled correctly even by the current query parser.
That's one of the reasons, why this hasn't been a big problem in the past.
If you use (a AND b) OR (c AND d) you will get what you expect.
It's just that I think the query parser should also create a reasonable 
query if the parenthesis are removed.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message