Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Subject: Re: strange performance issue with many shards on one server
From: Toke Eskildsen <te@statsbiblioteket.dk>
Reply-To: te@statsbiblioteket.dk
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
In-Reply-To: <A25F89C4C3214E2FAF078B66C2C330AA@gmail.com>
References: <A25F89C4C3214E2FAF078B66C2C330AA@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Organization: State and University Library, Denmark
Date: Wed, 28 Sep 2011 16:40:25 +0200
Message-ID: <1317220825.3165.72.camel@te-prime>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Wed, 2011-09-28 at 12:58 +0200, Frederik Kraus wrote:
> - 10 shards per server (needed for response times) running in a single tomcat instance

Have you tested that sharding actually decreases response times in your
case? I see the idea in decreasing response times with sharding at the
cost of decreasing throughput, but the added overhead of merging is
non-trivial.

> - each query queries all 20 shards (distributed search)
> 
> - each shard holds about 1.5 mio documents (small shards are needed due to rather complex queries)
> - all caches are warmed / high cache hit rates (99%) etc.

> Now for some reason we cannot seem to fully utilize all CPU power (no disk IO), ie. increasing concurrent users doesn't increase CPU-Load at a point, decreases throughput and increases the response times of the individual queries.

It sounds as if there's a hard limit on the number of concurrent users
somewhere. I am no expert in httpclient, but the blocked threads in your
thread dump seems to indicate that they wait for connections to be
established rather than for results to be produced.

I seem to remember that tomcat has a default limit on 200 concurrent
connections and with 10 shards/search, that is just 200 / (10
shard_connections + 1 incoming_connection) = 18 concurrent searches.

> Also 1-2% of the queries take significantly longer: avg somewhere at 100ms while 1-2% take 1.5s or longer. 

Could be garbage collection, especially since it shows under high load
which might result in more old objects and thereby trigger full gc.