Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 77978 invoked from network); 2 Apr 2009 14:46:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 14:46:51 -0000 Received: (qmail 40258 invoked by uid 500); 2 Apr 2009 14:46:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 40201 invoked by uid 500); 2 Apr 2009 14:46:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 40191 invoked by uid 99); 2 Apr 2009 14:46:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 14:46:48 +0000 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=HTML_MESSAGE,SPF_PASS,SUBJECT_FUZZY_TION X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 14:46:40 +0000 Received: by ewy27 with SMTP id 27so632025ewy.5 for ; Thu, 02 Apr 2009 07:46:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=7yHaM7hP0isWdn/N/h/1io2EtdUflzBbPENiR8luawA=; b=IKzE/xBjQVdQWYSpx9Z4QjUgjMlzTDiePc9jD9ySegNvbTi5z5ruLUVsn3wneYpO3E UzAbMFUMQd61MkDSeSl8PgzKJ76Zefq0VUr1xAMYOh8RnJAy6ovSq7Cghg4MZ7ZL5AcC lxJuEDxe97nLGo1ASknomdS7oIAIiMkwBOeWw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=eHAKTY7oeO4sCmQsMIi7FBIbKDg6cVobUyOFd0UnIMP2XPvieI+GfyEeYmIkJJP3ra NIoS0i8GP3sz/xxNt31SenHF66ofeJiacQwUd8mtMfYsXVloh96T1tjeS9bNYQGKsrAJ 22OPZEfzn/Lk3ve+TjAFRssTxIeDSnRgkdcDs= MIME-Version: 1.0 Received: by 10.220.91.148 with SMTP id n20mr167029vcm.68.1238683577186; Thu, 02 Apr 2009 07:46:17 -0700 (PDT) In-Reply-To: <999310.79121.qm@web112209.mail.gq1.yahoo.com> References: <702697.55499.qm@web112220.mail.gq1.yahoo.com> <359a92830904011051p6a2d3abfwbf6b7279ae7c1c87@mail.gmail.com> <580407.67598.qm@web112207.mail.gq1.yahoo.com> <359a92830904020634n6123817cmfeeee1a416ee32a6@mail.gmail.com> <999310.79121.qm@web112209.mail.gq1.yahoo.com> Date: Thu, 2 Apr 2009 10:46:17 -0400 Message-ID: <359a92830904020746p1d8c44abic6442c1a78ccd5d4@mail.gmail.com> Subject: Re: Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64758529c1e330466937edf X-Virus-Checked: Checked by ClamAV on apache.org --0016e64758529c1e330466937edf Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I didn't code it, so I'm speaking at least second hand.... It's a valid question whether having larger clauses is useful to the user. Having a 1024 term OR clause isn't narrowing that much. Plus, I think, it was a number that says, in effect, "you should know that this is getting to be an expensive query, don't be surprised if it takes a while". I suspect it's probably semi-arbitrary way to define "big and expensive" . <>> Well, yes. If you allowed this pathological query through. Which you did by bumping the max clauses parameter. In effect, you've been forced to consider this issue in your design rather than be left wondering, months from now when the system is in production, "why are some queries slow?"... One solution is to reject prefix queries that have fewer than, say, three leading characters. Similarly for single-character wildcard characters. The number of terms that have to be ORed together reduces drastically with more characters. I'd recommend you think first about rejecting pathological queries rather than wasting a lot of time making them work. I had to get over thinking that "any query should work, no matter how bizarre", which I did by thinking about how much use to a user a query that returned 100,000 results would actually be. You could limit pathology by either just dropping the max booleans back to a "reasonable" number as defined by you and responding with an appropriate message or pre-processing the query and rejecting it up front. Best Erick On Thu, Apr 2, 2009 at 9:55 AM, Lebiram wrote: > Hi Erick > > The query was a test data basically in anticipation of searches on all > indices (4 index) with 12 million docs > that should yield very small results. Obviously that query does not happen > in real life but it did break the system. > If some user thought of just inputting random words then the system will be > brought to its knees and eventually die. > > Essentially, all our lucene index has about 8 fields; 1 field is being used > as a filter (timestamp) > the rest are normal fields which can accept wildcards. > > You have a point in Filters being useful for a few other fields we do have. > I'll apply that. > So that leaves about 5 fields that allows fuzzy search. > > Which goes back to the max clause problem. Lucene's default Max Clause is > 1024, is there any reason behind this max? > > Thanks, > > M > > > ________________________________ > From: Erick Erickson > To: java-user@lucene.apache.org > Sent: Thursday, April 2, 2009 2:34:47 PM > Subject: Re: Search using MultiSearcher generates OOM on a 1GB total > Partitioned indeces > > Ah, I get it now. Given that you bumped your max clause up, it makes > sense. I'm pretty sure that the wildcard expansion is the root or your > memory problems. The folks on the list helped me out a lot understanding > what wildcards were about, see the thread titled "I just don't get > wildcards > at all" in the searchable archives from several years ago... > > Why do you want to generate queries of the form you showed? I'm > wondering if this is an XY problem and if you gave us a higher level > description of the problem you're trying to solve we'd be able to > suggest other approaches. I have a really hard time imagining a use > case where a user is well served by a clause that says > "any document that has word beginning with g and h and d and s.....", > so I'm assuming you're trying to solve something specific to your > domain..... > > But if you really, truly do require this form, consider Filters. If your > problem really requires single-letter starts, consider creating 26 > Filters at start up time and use those (see ConstantScoreQuery) > That'll chew up about 1.5M each of memory, faaaaar less than > you're consuming presently and will be blazingly fast. If you're > not limited to single-characters, *still* consider filters. They'll > consume little memory and are quite speedy to construct. > > Best > Erick > > > On Thu, Apr 2, 2009 at 5:04 AM, Lebiram wrote: > > > Hi Erick, > > > > I did a search just as JVM started... so I'm thinking that the JVM is > busy > > with some startup stuff... and that this search required more memory than > > what is available at that time. > > > > Had I done this search a while after the JVM has started, then this query > > succeeds. > > I then pump in several similar queries running on a different thread and > it > > takes a long time but still runs to completion until one of them > generates > > OOM.But still, queries like this is just using too much memory. > > > > As for clauses, the BooleanQuery was set to max clause of... 9,000,000 > > I'm guessing that might have caused the usage of too much memory? > > > > I'll try the explain on you've suggested. > > > > Thanks, > > > > M > > > > > > > > > > ________________________________ > > From: Erick Erickson > > To: java-user@lucene.apache.org > > Sent: Wednesday, April 1, 2009 6:51:13 PM > > Subject: Re: Search using MultiSearcher generates OOM on a 1GB total > > Partitioned indeces > > > > Think about putting this query in Luke and doing an "explain" for > details, > > but.... > > > > I'm surprised this is working at all without throwing TooManyClauses > > errors. > > Under the covers, Lucene expands your wildcards to all terms in the field > > that match. For instance, assume your document field has the following: > > aa > > ab > > ac > > ad > > ae > > > > Now, searching for a* produces a clause like: > > (aa OR ab OR ac OR ad OR ae) in place of the a* > > > > So your query is generating ginormous OR clauses, one that > > contains every term in your content field starting with 'g'. Another > > with every term in your content field starting with 'h' etc. So I suspect > > that your content field doesn't have very many distinct terms in it.... > > > > As for why it's returning few entries, what does this part of your > > query return by itself? Since it's anded with your wildcard query, > > it might be what's limiting your results. > > > > ((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9) > > (+viewers:cpuser9 +viewers:cpuser4)) > > > > But I'm puzzled, because saying that you're getting OOM errors > > doesn't square very well with getting *any* results at all, so is > > there something else going on? > > > > Best > > Erick@MoreQuestionsThanAnswers. > > > > > > On Wed, Apr 1, 2009 at 1:31 PM, Lebiram wrote: > > > > > Hi All, > > > > > > I have the following query on a 1GB index with about 12 million docs : > > > As you can see the terms consist of wildcards... > > > > > > query.toString()=+(+content:g* +content:h* +content:d* +content:s* > > > +content:a* +content:w* +content:b* +content:c* +content:m* > +content:e*) > > > +((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9) > > > (+viewers:cpuser9 +viewers:cpuser4)) > > > > > > The Searcher is a MultiSearcher with 4 IndexSearchers pointing to 4 > > > different Lucene Index. > > > This search returns very few records, several ten thousand hits. > > > > > > Java is assigned with 1GB max memory. > > > > > > Somehow this search eats the entire java heap space and causes OOM. > > > Looking at jProfiler, i see org.apache.lucene package generating a lot > of > > > objects which I believe is taking all this space. > > > > > > Can anyone explain the reason why this particular search might take so > > much > > > memory? > > > Is there anything I am doing wrong here? > > > More importantly, is there anything I can do to improve this? > > > > > > -M > > > > > > > > > > > > > > > > > > > > > > > > --0016e64758529c1e330466937edf--