lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lebiram <lebi...@ymail.com>
Subject Re: Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces
Date Thu, 02 Apr 2009 13:55:06 GMT
Hi Erick

The query was a test data basically in anticipation of searches on all indices (4 index) with
12 million docs
that should yield very small results. Obviously that query does not happen in real life but
it did break the system.
If some user thought of just inputting random words then the system will be brought to its
knees and eventually die.

Essentially, all our lucene index has about 8 fields; 1 field is being used as a filter (timestamp)
the rest are normal fields which can accept wildcards.

You have a point in Filters being useful for a few other fields we do have. I'll apply that.
So that leaves about 5 fields that allows fuzzy search.

Which goes back to the max clause problem. Lucene's default Max Clause is 1024, is there any
reason behind this max?

Thanks, 

M


________________________________
From: Erick Erickson <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, April 2, 2009 2:34:47 PM
Subject: Re: Search using MultiSearcher generates OOM on a 1GB total  Partitioned indeces

Ah, I get it now. Given that you bumped your max clause up, it makes
sense. I'm pretty sure that the wildcard expansion is the root or your
memory problems. The folks on the list helped me out a lot understanding
what wildcards were about, see the thread titled "I just don't get wildcards
at all" in the searchable archives from several years ago...

Why do you want to generate queries of the form you showed? I'm
wondering if this is an XY problem and if you gave us a higher level
description of the problem you're trying to solve we'd be able to
suggest other approaches. I have a really hard time imagining a use
case where a user is well served by a clause that says
"any document that has word beginning with g and h and d and s.....",
so I'm assuming you're trying to solve something specific to your
domain.....

But if you really, truly do require this form, consider Filters. If your
problem really requires single-letter starts, consider creating 26
Filters at start up time and use those (see ConstantScoreQuery)
That'll chew up about 1.5M each of memory, faaaaar less than
you're consuming presently and will be blazingly fast. If you're
not limited to single-characters, *still* consider filters. They'll
consume little memory and are quite speedy to construct.

Best
Erick


On Thu, Apr 2, 2009 at 5:04 AM, Lebiram <lebiram@ymail.com> wrote:

> Hi Erick,
>
> I did a search just as JVM started... so I'm thinking that the JVM is busy
> with some startup stuff... and that this search required more memory than
> what is available at that time.
>
> Had I done this search a while after the JVM has started, then this query
> succeeds.
> I then pump in several similar queries running on a different thread and it
> takes a long time but still runs to completion until one of them generates
> OOM.But still, queries like this is just using too much memory.
>
> As for clauses, the BooleanQuery was set to max clause of... 9,000,000
> I'm guessing that might have caused the usage of too much memory?
>
> I'll try the explain on you've suggested.
>
> Thanks,
>
> M
>
>
>
>
> ________________________________
> From: Erick Erickson <erickerickson@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Wednesday, April 1, 2009 6:51:13 PM
> Subject: Re: Search using MultiSearcher generates OOM on a 1GB total
>  Partitioned indeces
>
> Think about putting this query in Luke and doing an "explain" for details,
> but....
>
> I'm surprised this is working at all without throwing TooManyClauses
> errors.
> Under the covers, Lucene expands your wildcards to all terms in the field
> that match. For instance, assume your document field has the following:
> aa
> ab
> ac
> ad
> ae
>
> Now, searching for a* produces a clause like:
> (aa OR ab OR ac OR ad OR ae) in place of the a*
>
> So your query is generating ginormous OR clauses, one that
> contains every term in your content field starting with 'g'. Another
> with every term in your content field starting with 'h' etc. So I suspect
> that your content field doesn't have very many distinct terms in it....
>
> As for why it's returning few entries, what does this part of your
> query return by itself? Since it's anded with your wildcard query,
> it might be what's limiting your results.
>
> ((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
> (+viewers:cpuser9 +viewers:cpuser4))
>
> But I'm puzzled, because saying that you're getting OOM errors
> doesn't square very well with getting *any* results at all, so is
> there something else going on?
>
> Best
> Erick@MoreQuestionsThanAnswers.
>
>
> On Wed, Apr 1, 2009 at 1:31 PM, Lebiram <lebiram@ymail.com> wrote:
>
> > Hi All,
> >
> > I have the following query on a 1GB index with about 12 million docs :
> > As you can see the terms consist of wildcards...
> >
> > query.toString()=+(+content:g* +content:h* +content:d* +content:s*
> > +content:a* +content:w* +content:b* +content:c* +content:m* +content:e*)
> > +((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
> > (+viewers:cpuser9 +viewers:cpuser4))
> >
> > The Searcher is a MultiSearcher with 4 IndexSearchers pointing to 4
> > different Lucene Index.
> > This search returns very few records, several ten thousand hits.
> >
> > Java is assigned with 1GB max memory.
> >
> > Somehow this search eats the entire java heap space and causes OOM.
> > Looking at jProfiler, i see org.apache.lucene package generating a lot of
> > objects which I believe is taking all this space.
> >
> > Can anyone explain the reason why this particular search might take so
> much
> > memory?
> > Is there anything I am doing wrong here?
> > More importantly, is there anything I can do to improve this?
> >
> > -M
> >
> >
> >
>
>
>
>
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message