lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Oconnor <bocon...@plos.org>
Subject Re: Replicates not recovering after rolling restart
Date Fri, 22 Sep 2017 21:57:19 GMT

Thanks everyone for the responses.


I believe I have found the problem.


The type of __version__ is incorrect in our schema. This is a required field that is primarily
used by Solr.


Our schema has typed it as type=int instead of  type=long


I believe that this number is used by the replication process to figure out what needs to
be sync'd on an

individual replicate. In our case Solr puts the value in during indexing. It appears that
Solr has chosen a

number that cannot be represented by "int". As the replicates query the leader to determine
if a sync is

necessary the the leader throws an error as it try's to format the response with the large
_version_ .

This process continues until the replicates give up.


I finally verified this by doing a simple query _version_:*    which throws the same error
but gives

more helpful info "re-index your documents"


Thanks.





________________________________
From: Rick Leir <rleir@leirtech.com>
Sent: Friday, September 22, 2017 12:34:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Replicates not recovering after rolling restart

Wunder, Erick

$ dc
16o
1578578283947098112p
15E83C95E8D00000

That is an interesting number. Is it, as a guess, machine instructions
or an address pointer? It does not look like UTF-8 or ASCII. Machine
code looks promising:


Disassembly:

0:  15 e8 3c 95 e8          adc    eax,0xe8953ce8
5:  d0 00                   rol    BYTE PTR [rax],1
....

/ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands
placing the result in the destination.

*ROL - Rotate Left*

Registers: the/64-bit/extension of/eax/is called/rax/.

Is that code possibly in the JVM executable? Or a random memory page.

cheers -- Rick

On 2017-09-20 07:21 PM, Walter Underwood wrote:
> 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target?
>
> That doesn’t explain where it came from, of course.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> The numberformatexception is...odd. Clearly that's too big a number
>> for an integer, did anything in the underlying schema change?
>>
>> Best,
>> Erick
>>
>> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wunder@wunderwood.org> wrote:
>>> Rolling restarts work fine for us. I often include installing new configs with
that. Here is our script. Pass it any hostname in the cluster. I use the load balancer name.
You’ll need to change the domain and the install directory of course.
>>>
>>> #!/bin/bash
>>>
>>> cluster=$1
>>>
>>> hosts=`curl -s "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json"
| jq -r '.cluster.live_nodes[]' | sort`
>>>
>>> for host in $hosts
>>> do
>>>     host="${host}.cloud.cheggnet.com"
>>>     echo restarting Solr on $host
>>>     ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin bin/solr
start -cloud -h `hostname`'
>>> done
>>>
>>>
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <boconnor@plos.org> wrote:
>>>>
>>>> Hello,
>>>>
>>>>
>>>> Background:
>>>>
>>>>
>>>> We have been successfully using Solr for over 5 years and we recently made
the decision to move into SolrCloud. For the most part that has been easy but we have repeated
problems with our rolling restart were server remain functional but stay in Recovery until
they stop trying. We restarted because we increased the memory from 12GB to 16GB on the JVM.
>>>>
>>>>
>>>> Does anyone have any insight as to what is going on here?
>>>>
>>>> Is there a special procedure I should use for starting a stopping host?
>>>>
>>>> Is it ok to do a rolling restart on all the nodes in s shard?
>>>>
>>>>
>>>> Any insight would be appreciated.
>>>>
>>>>
>>>> Configuration:
>>>>
>>>>
>>>> We have a group of servers with multiple collections. Each collection consist
of one shard and multiple replicates. We are running the latest stable version of SolrClound
6.6 on Ubuntu LTS and Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
>>>>
>>>>
>>>> (collection)              (shard)          (replicates)
>>>>
>>>> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221,
solr-222 (replicates)
>>>>
>>>>
>>>> Problem:
>>>>
>>>>
>>>> Restarting the system puts the replicates in a recovery state they never
exit from. They eventually give up after 500 tries.  If I go to the individual replicates
and execute a query the data is still available.
>>>>
>>>>
>>>> Using tcpdump I find the replicates sending this request to the leader (the
leader appears to be active).
>>>>
>>>>
>>>> The exchange goes  like this - :
>>>>
>>>>
>>>> solr-220 is the leader.
>>>>
>>>> Solr-221 to Solr-220
>>>>
>>>>
>>>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
>>>>
>>>>
>>>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
>>>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
>>>> User-Agent: Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient]
1.0
>>>> Content-Length: 108
>>>> Host: solr-220:8983
>>>> Connection: Keep-Alive
>>>>
>>>>
>>>> commit_end_point=true&openSearcher=false&commit=true&softCommit=false&waitSearcher=true&wt=javabin&version=2
>>>>
>>>>
>>>> Solr-220 back to Solr-221
>>>>
>>>>
>>>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win
235, options [nop,nop,
>>>> TS val 858155553 ecr 858107069], length 5151
>>>> ..HTTP/1.1 500 Server Error
>>>> Content-Type: application/octet-stream
>>>> Content-Length: 5060
>>>>
>>>>
>>>> .responseHeader..&statusT..%QTimeC.%error..#msg?.For input string: "1578578283947098112".%trace?.&java.lang.NumberFormatException:
For
>>>> input string: "1578578283947098112"
>>>> at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>> at java.lang.Integer.parseInt(Integer.java:583)
>>>> at java.lang.Integer.parseInt(Integer.java:615)
>>>> at org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
>>>> at org.apache.solr<http://org.apache.solr/>.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
>>>> at org.apache.solr<http://org.apache.solr/>.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
>>>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
>>>> at org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
>>>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
>>>> at org.apache.solr<http://org.apache.solr/>.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
>>>> at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709)
>>>>
>>>> at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267)
>>>>
>>>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message