lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: WilcardQuery and memory
Date Fri, 09 Mar 2007 13:10:21 GMT
You can also use a filter. The basic idea is to construct a Lucene
Filter, probably using something like RegexTermEnum/TermDocs.
It's faster than you think <G>. This, in combination with
ConstantScoreQuery should fix you right up. Several things:

1> you lose scoring with the filter part of a query with this technique.
If you search on, say erick AND h*s, your query would be scored
entirely by the 'erick' part of the clause. But you'd match documents
with 'erick' and any of the terms 'hoss' 'hiss' 'his' 'hers' .....

2> the size is capped by the total docs / 8, since a filter is really
a bitmap with one bit position for each doc in your index.

3> Filters can be cached, see CachingWrapperFilter

Hope this helps
Erick

On 3/9/07, Rob Staveley (Tom) <rstaveley@seseit.com> wrote:
>
> For indexing e-mail, I recommend that you tokenise the e-mail addresses
> into
> fragments and query on the fragments as whole terms rather than using
> wildcards.
>
> Rather than looking for fischauto333* in (say) smtp-from, look for
> fischauto333 in (say) an additional field called smtp-from-fragments to
> match the term fischauto333 (it also contains the terms yahoo.de and yahoo
> and de).
>
> Having whole e-mail addresses in terms and using prefix/wildcard queries
> inevitably results in too many clauses.
>
>
> -----Original Message-----
> From: Joe [mailto:fischauto333@yahoo.de]
> Sent: 09 March 2007 12:08
> To: java-user@lucene.apache.org
> Subject: WilcardQuery and memory
>
> Hi,
>
> Here we use lucene to index our emails, currently 500.000 Documents.
> When Searching the body by a WildcardQuery the problems arises.
>
> I did some profiling with JProfiler. I see the more BooleanClause
> instances used
> the more memory is required during search.
> Most memory is used by instances of classes SegementTermDocs and
> CoumpundFileRader.CSIndexInput
> Nearly 30 MB/1000 BooleanClause instances.
>
> So when i limit the BooleanClauses by BooleanQuery.setMaxClauseCount
> (5000);
> and more and more emails gets indexed, will this lead to a
> OutOfMemoryError
> or will the 30MB/1000 Clauses ratio still hold?
>
> What should i do to prevent OOM without reducing
> BooleanQuery.setMaxClauseCount() to much?
>
>
>
>
>
>
> ___________________________________________________________
> Der frühe Vogel fängt den Wurm. Hier gelangen Sie zum neuen Yahoo! Mail:
> http://mail.yahoo.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message