lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: searching words starting with accent characters using UTF-8
Date Tue, 11 Dec 2001 05:38:25 GMT
> From: Brian Goetz [mailto:brian@quiotix.com]
>
> It shouldn't be hard to fix this problem once and
> for all, and I'm happy to do so [ ... ]

Thanks!  That would be great!

> The query parser was later extended to support fuzzy and wildcard
> queries.  A few people here, including Doug and myself, questioned the
> wisdom of putting chainsaws in the hands of the untrained; innocent,
> inadvertent misuse of these features effectively constitute a
> denial-of- service attack.  I'm of the opinion that these advanced
> features should be left to the programmatic query classes, or perhaps
> to a separate "power-user query parser", and not part of the basic
> query language.  This discussion was started about a month and a half
> ago, but never came to any agreement or even any real discussion.  

I have mixed feelings about this.

On one hand, I agree.  As it stands, with these features, it is very easy
for folks to construct queries that will be extremely slow and chew up gobs
of memory.  That is a bad thing.

On the other hand, there are many cases where prefix and perhaps wildcard
queries are very useful.  Some languages are much more morphologically
productive than English--a word can have hundreds of forms.  If we don't
provide a query parser that supports this, then folks might (a) ask about it
every week, (b) post poorly written hacks to implement it, (c) complain
about bugs in poorly written hacks they downloaded, and (d) all of the
above.

So we should be careful about how we add such features, but also try to add
features that are called for.

In the case of prefixes and wildcards, I think a minimum prefix length would
go a long ways towards making these less dangerous.  Four or more
non-wildcard characters should be required at the beginning of terms.  We
might also place a limit on the number of terms that these are permitted to
expand to, throwing an exception when, e.g., more than 100 terms are
matched.

Another query feature that is commonly asked for is the ability to search
multiple fields with a single clause.  Perhaps the default field could be
specified as an array of field names to handle this?

And folks have also asked for a way to specify phrase slop, i.e. a NEAR or
WITHIN operator.  It would be great to have some support for this too in the
query parser.

Yes, this is a lot of features, and a lot of syntax.  The query parser is
already complicated.  Perhaps we should instead write a number of example
query parsers that do different things, and encourage folks to write their
own, with these as models.  Unfortunately, I'm not sure many folks would do
that: instead they would ask why one parser doesn't have a feature that
another does.  So I'm having a hard time seeing a non-kitchen-sink
alternative.  Do you?

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message