jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl <ki...@subshell.com>
Subject Query performance for large query results
Date Mon, 27 Nov 2006 10:09:29 GMT
Hi,

I noticed that the time a query takes corresponds to the total size of the query 
result. Digging into the code I found the following lines in 
org.apache.jackrabbit.core.query.lucene.QueryImpl:

result = index.executeQuery(this, query, orderProperties, ascSpecs);
ids = new ArrayList(result.length());
scores = new ArrayList(result.length());

for (int i = 0; i < result.length(); i++) {
     NodeId id = NodeId.valueOf(result.doc(i).get(FieldNames.UUID));
     // check access
     if (accessMgr.isGranted(id, AccessManager.READ)) {
         ids.add(id);
         scores.add(new Float(result.score(i)));
     }
}


The first line where the lucene query is executed is no problem. The problem 
apparently starts in the loop where the UUID of _every_ found document is 
fetched. If you get search result with 10000+ documents, which we do, and you 
only need the first 20 documents, this becomes a bottleneck.
Two possible solutions come to my mind:

1. Use a lazy QueryResultImpl that keeps a reference to the result and only 
fetches the UUIDs for requested nodes. This imposes that the access check is 
done in the QueryResultImpl and the result size returned by size() may vary if 
you don't have access to some nodes (which it already does if node in the result 
gets deleted). The real problem is how to trigger result.close() which closes 
the index. I'm even not sure if it causes problems if indexes are not closed as 
fast as possible.

2. Use a reverse DocNumberCache. To be really effective this cache has to cache 
all docNum to UUID relations, because even if just 500 out of 10000 are uncached 
this already gives a performance hit. So a cache with a fixed size wouldn't be 
sufficient.

Any thoughts?

Cheers,
Christoph


Mime
View raw message