lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Analyzing performance and memory consumption for boolean queries
Date Wed, 24 Jun 2009 03:16:16 GMT


Based on the description, I'd suspect unnecessarily(?) large JVM heap and insufficient RAM
for caching the actual index.  Run vmstat while querying the index and watch columns: bi,
bo, si, so, wa, and id. :)  If what I said above is correct, then you should see more data
loaded from disk during those clow queries and probably a jump in the wa column if you are
running multiple concurrent queries.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Nigel <>
> To:
> Sent: Tuesday, June 23, 2009 4:53:09 PM
> Subject: Analyzing performance and memory consumption for boolean queries
> Our query performance is surprisingly inconsistent, and I'm trying to figure
> out why.  I've realized that I need to better understand what's going on
> internally in Lucene when we're searching.  I'd be grateful for any answers
> (including pointers to existing docs, if any).
> Our situation is this: We have roughly 250 million docs spread across four
> indexes.  Each doc has about a dozen fields, all stored and most indexed.
> (They're the usual document things like author, date, title, contents,
> etc.)  Queries differ in complexity but always have at least a few terms in
> boolean combination, up to some larger queries with dozens or even hundreds
> of terms combined with ands, ors, nots, and parens.  There's no sorting,
> even by relevance: we just want to know what matches.  Query performance is
> often sub-second, but not infrequently it can take over 20 seconds (we the
> time-limited hit collector, so anything over 20 seconds is stopped).
> Obviously the more complex queries are slower on average, but a given query
> can sometimes be much slower or much faster.
> My assumption is that we're having memory problems or disk utilization
> problems or both.  Our app has a 5gb JVM heap on an 8gb server with no other
> user processes running, so we shouldn't be paging and should have some room
> for Linux disk cache.  The server is lightly loaded and concurrent queries
> are the exception rather than the norm.  Two of the four indexes are updated
> a few times a day via rsync and subsequently closed and re-opened, but poor
> query performance doesn't seem to be correlated with these times.
> So, getting to some specific questions:
> 1) How is the inverted index for a given field structured in terms of what's
> in memory and what's on disk?  Is it dynamic, based on available memory, or
> tuneable, or fixed?  Is there a rule of thumb that could be used to estimate
> how much memory is required per indexed field, based on the number of terms
> and documents?  Likewise, is there a rule of thumb to estimate how many disk
> accesses are required to retrieve the hits for that field?  (I'm thinking,
> by perhaps false analogy, of how a database maintains a b-tree structure
> that may reside partially in RAM cache and partially in disk pages.)
> 2) When boolean queries are searched, is it as simple as iterating the hits
> for each ANDed or ORed term and applying the appropriate logical operators
> to the results?  For example, is searching for "foo AND bar" pretty much the
> same resource-wise as doing two separate searches, and therefore should the
> query performance be a linear function of the number the number of search
> terms?  Or is there some other caching and/or decision logic (perhaps kind
> of like a database's query optimizer) at work here that makes the I/O and
> RAM requirements more difficult to model from the query?  (Remember that
> we're not doing any sorting.)
> I'm hoping that with some of this knowledge, I'll be able to better model
> the RAM and I/O usage of the indexes and queries, and thus eventually
> understand why things are slow or fast.
> Thanks,
> Chris

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message