lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lebiram <lebi...@ymail.com>
Subject Re: Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces
Date Thu, 02 Apr 2009 16:33:21 GMT
I think I have looked at constant score queries however, the relevance is of value to the users
so we left it as is. :(

Erick's idea of stripping terms with wildcards that has less then an acceptable number of
characters is a good idea and I might try it once I get the time.

Thanks, 

M




________________________________
From: Mark Miller <markrmiller@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, April 2, 2009 3:52:24 PM
Subject: Re: Search using MultiSearcher generates OOM on a 1GB total  Partitioned indeces

You might try a constant score wildcard query (similar to a filter) - I 
think you'd have to grab it from solr's codebase until 2.9 comes out 
though. No clause limit, and reportedly *much* faster on large indexes.

-- 
- Mark

http://www.lucidimagination.com



Lebiram wrote:
> Hi Erick
>
> The query was a test data basically in anticipation of searches on all indices (4 index)
with 12 million docs
> that should yield very small results. Obviously that query does not happen in real life
but it did break the system.
> If some user thought of just inputting random words then the system will be brought to
its knees and eventually die.
>
> Essentially, all our lucene index has about 8 fields; 1 field is being used as a filter
(timestamp)
> the rest are normal fields which can accept wildcards.
>
> You have a point in Filters being useful for a few other fields we do have. I'll apply
that.
> So that leaves about 5 fields that allows fuzzy search.
>
> Which goes back to the max clause problem. Lucene's default Max Clause is 1024, is there
any reason behind this max?
>
> Thanks, 
>
> M
>
>
> ________________________________
> From: Erick Erickson <erickerickson@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, April 2, 2009 2:34:47 PM
> Subject: Re: Search using MultiSearcher generates OOM on a 1GB total  Partitioned indeces
>
> Ah, I get it now. Given that you bumped your max clause up, it makes
> sense. I'm pretty sure that the wildcard expansion is the root or your
> memory problems. The folks on the list helped me out a lot understanding
> what wildcards were about, see the thread titled "I just don't get wildcards
> at all" in the searchable archives from several years ago...
>
> Why do you want to generate queries of the form you showed? I'm
> wondering if this is an XY problem and if you gave us a higher level
> description of the problem you're trying to solve we'd be able to
> suggest other approaches. I have a really hard time imagining a use
> case where a user is well served by a clause that says
> "any document that has word beginning with g and h and d and s.....",
> so I'm assuming you're trying to solve something specific to your
> domain.....
>
> But if you really, truly do require this form, consider Filters. If your
> problem really requires single-letter starts, consider creating 26
> Filters at start up time and use those (see ConstantScoreQuery)
> That'll chew up about 1.5M each of memory, faaaaar less than
> you're consuming presently and will be blazingly fast. If you're
> not limited to single-characters, *still* consider filters. They'll
> consume little memory and are quite speedy to construct.
>
> Best
> Erick
>
>
> On Thu, Apr 2, 2009 at 5:04 AM, Lebiram <lebiram@ymail.com> wrote:
>
>  
>> Hi Erick,
>>
>> I did a search just as JVM started... so I'm thinking that the JVM is busy
>> with some startup stuff... and that this search required more memory than
>> what is available at that time.
>>
>> Had I done this search a while after the JVM has started, then this query
>> succeeds.
>> I then pump in several similar queries running on a different thread and it
>> takes a long time but still runs to completion until one of them generates
>> OOM.But still, queries like this is just using too much memory.
>>
>> As for clauses, the BooleanQuery was set to max clause of... 9,000,000
>> I'm guessing that might have caused the usage of too much memory?
>>
>> I'll try the explain on you've suggested.
>>
>> Thanks,
>>
>> M
>>
>>
>>
>>
>> ________________________________
>> From: Erick Erickson <erickerickson@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, April 1, 2009 6:51:13 PM
>> Subject: Re: Search using MultiSearcher generates OOM on a 1GB total
>>  Partitioned indeces
>>
>> Think about putting this query in Luke and doing an "explain" for details,
>> but....
>>
>> I'm surprised this is working at all without throwing TooManyClauses
>> errors.
>> Under the covers, Lucene expands your wildcards to all terms in the field
>> that match. For instance, assume your document field has the following:
>> aa
>> ab
>> ac
>> ad
>> ae
>>
>> Now, searching for a* produces a clause like:
>> (aa OR ab OR ac OR ad OR ae) in place of the a*
>>
>> So your query is generating ginormous OR clauses, one that
>> contains every term in your content field starting with 'g'. Another
>> with every term in your content field starting with 'h' etc. So I suspect
>> that your content field doesn't have very many distinct terms in it....
>>
>> As for why it's returning few entries, what does this part of your
>> query return by itself? Since it's anded with your wildcard query,
>> it might be what's limiting your results.
>>
>> ((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
>> (+viewers:cpuser9 +viewers:cpuser4))
>>
>> But I'm puzzled, because saying that you're getting OOM errors
>> doesn't square very well with getting *any* results at all, so is
>> there something else going on?
>>
>> Best
>> Erick@MoreQuestionsThanAnswers.
>>
>>
>> On Wed, Apr 1, 2009 at 1:31 PM, Lebiram <lebiram@ymail.com> wrote:
>>
>>    
>>> Hi All,
>>>
>>> I have the following query on a 1GB index with about 12 million docs :
>>> As you can see the terms consist of wildcards...
>>>
>>> query.toString()=+(+content:g* +content:h* +content:d* +content:s*
>>> +content:a* +content:w* +content:b* +content:c* +content:m* +content:e*)
>>> +((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
>>> (+viewers:cpuser9 +viewers:cpuser4))
>>>
>>> The Searcher is a MultiSearcher with 4 IndexSearchers pointing to 4
>>> different Lucene Index.
>>> This search returns very few records, several ten thousand hits.
>>>
>>> Java is assigned with 1GB max memory.
>>>
>>> Somehow this search eats the entire java heap space and causes OOM.
>>> Looking at jProfiler, i see org.apache.lucene package generating a lot of
>>> objects which I believe is taking all this space.
>>>
>>> Can anyone explain the reason why this particular search might take so
>>>      
>> much
>>    
>>> memory?
>>> Is there anything I am doing wrong here?
>>> More importantly, is there anything I can do to improve this?
>>>
>>> -M
>>>
>>>
>>>
>>>      
>>
>>
>>
>>    
>
>
>
>      
>  





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message