lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: QueryParserUtil, big query with wildcards -> runs endlessly and produces heavy load
Date Thu, 26 Jun 2014 16:30:38 GMT
I suspect you're getting leading wildcard searches as well, which must
do entire term scans unless you're doing the reverse trick.

Replacing all successive whitespace gives you:
Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet.*Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet

Note, no spaces. Then you're pushing it through the KeywordTokenizer
which does essentially nothing. What a term!

Your point is valid however, why this is taking so long I don't quite
know. But I tend to agree that it's such an edge case that the
hard-core FST guys would look at it for curiosity's sake only....

Best,
Erick


On Thu, Jun 26, 2014 at 5:34 AM, Jack Krupansky <jack@basetechnology.com> wrote:
> I'll defer the the hard-core Lucene committers for the technical details,
> but I would suggest that a very large term with dozens of wildcards is a
> "known limitation" (albeit not well-documented.) IOW, to use wildcards in
> Lucene in a performant manner, they need to be "brief".
>
> -- Jack Krupansky
>
> -----Original Message----- From: Clemens Wyss DEV
> Sent: Thursday, June 26, 2014 3:17 AM
> To: java-user@lucene.apache.org
> Subject: QueryParserUtil, big query with wildcards -> runs endlessly and
> produces heavy load
>
>
> The following "testcase" runs endlessly and produces VERY heavy load.
> ...
> String query = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed
> diam nonumy eirmod tempor invidunt ut "
> + "labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et
> accusam et justo duo dolores et "
> + "ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
> ipsum dolor sit amet. "
> + "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
> eirmod tempor invidunt "
> + "ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
> et accusam et justo duo dolores "
> + "et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
> ipsum dolor sit amet";
> String query  = query.replaceAll( "\\s+", "*" );
> try
> {
> QueryParserUtil.parse( query, new String[] { "test" }, new Occur[] {
> Occur.MUST }, new KeywordAnalyzer() );
> }
> catch ( Exception e )
> {
> Assert.fail( e.getMessage() );
> }
> ...
> I don't say this testcase makes "sense", nevertheless the question remains
> whether this is a bug or a "feature"?
>
> Context: Lucene 4.7.2, Java 6
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message