lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: 7GB index taking forever to return hits
Date Tue, 15 Aug 2006 01:20:29 GMT
I actually suspect that your process isn't hung, it's just taking forever
because it's swapping a lot. Like a really, really, really lot. Like more
than you ever want to deal with <G>.

I think you're pretty much forced, as Lin said, to use a filter. I was
pleasantly surprised at how quickly filters can be built, and if you have a
small set of wildcards you really expect, you can use a CachingWrapperFilter
to keep them around. NOTE: we're only recommending the filters for the
wildcard portions of the query.

For a generous explanation of wildcards in general from "the guys", search
the archive for a thread "I just don't get wildcards at all". There's quite
an explanation of wildcards, as well as some alternative indexing schemes to
make wildcard queries without wildcards. Which is faster but bloats the
index.

If you're unfamiliar with filters, you really need to become familiar with
them. Basically, they're a bitset with the bit corresponding to each
document turned on for terms that could match.

You'll probably want to use a WildcardTermEnum, or perhaps a RegexTermEnum,
and for each term, enumerate all the docs containing that term (see
TermDocs) and set your bit in the filter. When the filter is created, use it
to construct a ConstantScoreQuery that you add to your BooleanQuery. It
sounds more complicated than it is <G>.

Just to clarify Lin's comment. The filter won't be slower than what you are
doing now. But it will be slower than a straight-up query. Each filter will
be under a megabyte (one bit for each of 5.5M docs), so you could consider
caching a bunch of them if you expect a small number of terms to be
repeated. Or let CachingWrapperFilter do it for you (I think).

Note that ConstantScoreQuery loses relevance scoring (but you'll still get
relevance for the non-wildcard terms in your BooleanQuery). I don't consider
this a flaw since it's tricky defining how much *use* relevance really is
for wildcards.

Finally, your test queries (and I commend you for making limit-testing tests
before putting it in production) are about as bad as they get. From the
javadoc for WildcardQuery.

" In order to prevent extremely slow WildcardQueries, a Wildcard term should
not start with one of the wildcards * or ?"

Best
Erick



On 8/14/06, yueyu lin <popeyelin@gmail.com> wrote:
>
> To avoid "TooManyClauses", you can try Filter instead of Query. But that
> will be slower.
> Form what I see is that there are so many keys that match your query, it
> will be tough for Lucene.
>
> On 8/14/06, Van Nguyen <vnguyen@ur.com> wrote:
> >
> > It was how I was implementing the search.
> >
> > I am using a boolean query.  Prior to the 7GB index, I was searching
> > over a 150MB index that consist of a very small part of the bigger
> > index.  I was able to set my BooleanQuery to
> > BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked fine.
> > But I think that's the cause of my problem with this bigger index.
> > Commenting that out, I get an TooManyClause Exception.  A typical query
> > would look something like this:
> >
> > +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
> > +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72
> >
> > BooleanQuery q = new BooleanQuery();
> >
> > WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
> > WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
> > WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
> > q.add(wc1, BooleanClause.Occur.MUST);
> > q.add(wc2, BooleanClause.Occur.MUST);
> > q.add(wc3, BooleanClause.Occur.MUST);
> >
> > TermQuery t1 = new TermQuery("COMPANY_CODE", "u1");
> > q.add(t1, BooleanClause.Occur.MUST);
> >
> > TermQuery t2 = new TermQuery("LANGUAGE", "enu");
> > q.add(t2, BooleanClause.Occur.MUST);
> > .
> > .
> > .
> >
> > I take it this is not the most optimal way about this.
> >
> > So that leads me to my next question... What is the most optimal way
> > about this?
> >
> > Van
> >
> > -----Original Message-----
> > From: yueyu lin [mailto:popeyelin@gmail.com]
> > Sent: Monday, August 14, 2006 11:30 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: 7GB index taking forever to return hits
> >
> > 2GB limitation only exists when you want to put them to memory in 32bits
> > box.
> > Our index size is larger than 13 giga bytes, and it works fine.
> > I think it must be something error in your design. You can use Luke to
> > see what happened in your index.
> >
> > On 8/14/06, Van Nguyen <vnguyen@ur.com> wrote:
> > >
> > >  Hi,
> > >
> > >
> > >
> > > I have a 7GB index (about 45 fields per document X roughly 5.5 million
> > > docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory).
> >
> > > The index is optimized.  Performing a search on this index will just
> > > "hang" when performing the search (wild card query with a sort).  At
> > > first the CPU usage is 100%, then drops down to 50% after a minute or
> > > so, and then no CPU utilization... but the thread is still trying to
> > > perform the search.  I've tried this in my J2EE app and in a main
> > > program.  Is this due to the 2GB limitation of the 32bit OS (I didn't
> > > realize the index would be this big... just let it run over the
> > weekend).
> > >
> > >
> > >
> > > If this is due to the 2GB limitation of the 32bit OS and since I have
> > > this 7GB index built already (and optimized), is there a way to split
> > > this into 2GB indices w/o having to re-index?  Or is this due to
> > another factor?
> > >
> > >
> > >
> > > Van
> > >
> > > United Rentals
> > > Consider it done.(tm)
> > > 800-UR-RENTS
> > > unitedrentals.com
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > --
> > Yueyu Lin
> >
> > United Rentals
> > Consider it done.™
> > 800-UR-RENTS
> > unitedrentals.com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> --
> Yueyu Lin
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message