lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <Harald.Kir...@raytion.com>
Subject Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem
Date Mon, 14 Jul 2014 11:45:25 GMT
This problem seems to completely disappear under load. I started making 
load tests despite fearing them to be useless. It turns out that there 
are no more 50000 ms delays under load.

Harald.

On 09.07.2014 09:50, Harald Kirsch wrote:
> Good point. I will see if I can get the necessary access rights on this
> machine to run tcpdump.
>
> Thanks for the suggestion,
> Harald.
>
> On 09.07.2014 00:32, Steve McKay wrote:
>> Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr
>> starts behaving strangely in a socket-related way. Knowing exactly
>> what's happening at the transport level is worth a month of guessing
>> and poking.
>>
>> On Jul 8, 2014, at 3:53 AM, Harald Kirsch <Harald.Kirsch@raytion.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is what happens when I run a regular wget query to log the
>>> current number of documents indexed:
>>>
>>> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
>>> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
>>> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
>>> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
>>> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
>>> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
>>> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
>>> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
>>> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
>>> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
>>> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
>>> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
>>> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
>>> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
>>> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
>>> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
>>> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
>>> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>>>
>>> As you can see, the QTime is just over 50 seconds at irregular
>>> intervals.
>>>
>>> This happens independent of whether I am indexing documents with
>>> around 20 dps or not. First I thought about a dependence on the
>>> auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
>>>
>>> Furthermore, and this is *really strange*: when hooking strace on the
>>> solr process, the 50 seconds QTimes disappear completely and
>>> consistently --- a real Heisenbug.
>>>
>>> Nevertheless, strace shows that there is a socket timeout of 50
>>> seconds defined in calls like this:
>>>
>>> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
>>> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>>>
>>> where the fd=96 is the result of
>>>
>>> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
>>> sin_port=htons(57236), sin_addr=inet_addr("ip address of local
>>> host")}, [16]) = 96 <0.000054>
>>>
>>> where again fd=122 is the TCP port on which solr was started.
>>>
>>> My hunch is that this is communication between the cores of solr.
>>>
>>> I tried to search the internet for such a strange connection between
>>> socket timeouts and strace, but could not find anything (the
>>> stackoverflow entry from yesterday is my own :-(
>>>
>>>
>>> This smells a bit like a race condition/deadlock kind of thing which
>>> is broken up by timing differences introduced by stracing the process.
>>>
>>> Any hints appreciated.
>>>
>>> For completeness, here is my setup:
>>> - solr-4.8.1,
>>> - cloud version running
>>> - 10 shards on 10 cores in one instance
>>> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
>>> PATCHLEVEL 2
>>> - hosted on a vmware, 4 CPU cores, 16 GB RAM
>>> - single digit million docs indexed, exact number does not matter
>>> - zero query load
>>>
>>>
>>> Harald.
>>
>>
>


Mime
View raw message