lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Russakovskii (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-1458) Java Replication error: NullPointerException SEVERE: SnapPull failed on 2009-09-22 nightly
Date Fri, 25 Sep 2009 22:34:16 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759784#action_12759784
] 

Artem Russakovskii edited comment on SOLR-1458 at 9/25/09 3:32 PM:
-------------------------------------------------------------------

I haven't changed any configs yet, and this probably doesn't come as a shock to you guys,
but the master just ran out of space. Upon inspection, I found 30+ snapshot dirs sitting around
in /data.

Paul, adding your deletionPolicy fix didn't delete the files, even after optimize. Is that
expected?

{code}
drwxrwxr-x  2 bla bla  4096 Sep 23 18:42 snapshot.20090923064214
drwxrwxr-x  2 bla bla  4096 Sep 23 19:15 snapshot.20090923071530
drwxrwxr-x  2 bla bla  4096 Sep 23 19:45 snapshot.20090923074535
drwxrwxr-x  2 bla bla  4096 Sep 23 20:15 snapshot.20090923081531
drwxrwxr-x  2 bla bla  4096 Sep 23 21:15 snapshot.20090923091531
drwxrwxr-x  2 bla bla  4096 Sep 23 22:15 snapshot.20090923101532
drwxrwxr-x  2 bla bla  4096 Sep 23 23:15 snapshot.20090923111533
drwxrwxr-x  2 bla bla  4096 Sep 24 01:15 snapshot.20090924011501
drwxrwxr-x  2 bla bla  4096 Sep 24 13:15 snapshot.20090924011535
drwxrwxr-x  2 bla bla  4096 Sep 24 02:15 snapshot.20090924021501
drwxrwxr-x  2 bla bla  4096 Sep 24 14:15 snapshot.20090924021534
drwxrwxr-x  2 bla bla  4096 Sep 24 15:15 snapshot.20090924031501
drwxrwxr-x  2 bla bla  4096 Sep 24 03:15 snapshot.20090924031502
drwxrwxr-x  2 bla bla  4096 Sep 24 04:15 snapshot.20090924041501
drwxrwxr-x  2 bla bla  4096 Sep 24 16:15 snapshot.20090924041536
drwxrwxr-x  2 bla bla  4096 Sep 24 05:15 snapshot.20090924051501
drwxrwxr-x  2 bla bla  4096 Sep 24 17:15 snapshot.20090924051537
drwxrwxr-x  2 bla bla  4096 Sep 24 06:15 snapshot.20090924061501
drwxrwxr-x  2 bla bla  4096 Sep 24 18:15 snapshot.20090924061534
drwxrwxr-x  2 bla bla  4096 Sep 24 07:15 snapshot.20090924071501
drwxrwxr-x  2 bla bla  4096 Sep 24 19:15 snapshot.20090924071533
drwxrwxr-x  2 bla bla  4096 Sep 24 08:15 snapshot.20090924081534
drwxrwxr-x  2 bla bla  4096 Sep 24 20:15 snapshot.20090924081535
drwxrwxr-x  2 bla bla  4096 Sep 24 09:15 snapshot.20090924091501
drwxrwxr-x  2 bla bla  4096 Sep 24 21:15 snapshot.20090924091532
drwxrwxr-x  2 bla bla  4096 Sep 24 10:15 snapshot.20090924101501
drwxrwxr-x  2 bla bla  4096 Sep 24 22:15 snapshot.20090924101533
drwxrwxr-x  2 bla bla  4096 Sep 24 11:15 snapshot.20090924111501
drwxrwxr-x  2 bla bla  4096 Sep 24 23:15 snapshot.20090924111532
drwxrwxr-x  2 bla bla  4096 Sep 24 12:15 snapshot.20090924121532
drwxrwxr-x  2 bla bla  4096 Sep 24 00:15 snapshot.20090924121533
drwxrwxr-x  2 bla bla  4096 Sep 25 01:15 snapshot.20090925011533
drwxrwxr-x  2 bla bla  4096 Sep 25 13:15 snapshot.20090925011540
drwxrwxr-x  2 bla bla  4096 Sep 25 02:15 snapshot.20090925021534
drwxrwxr-x  2 bla bla  4096 Sep 25 14:15 snapshot.20090925021540
drwxrwxr-x  2 bla bla  4096 Sep 25 03:15 snapshot.20090925031535
drwxrwxr-x  2 bla bla  4096 Sep 25 15:15 snapshot.20090925031540
drwxrwxr-x  2 bla bla  4096 Sep 25 15:29 snapshot.20090925032931
drwxrwxr-x  2 bla bla  4096 Sep 25 04:15 snapshot.20090925041535
drwxrwxr-x  2 bla bla  4096 Sep 25 05:15 snapshot.20090925051539
drwxrwxr-x  2 bla bla  4096 Sep 25 06:15 snapshot.20090925061538
drwxrwxr-x  2 bla bla  4096 Sep 25 07:15 snapshot.20090925071539
drwxrwxr-x  2 bla bla  4096 Sep 25 08:15 snapshot.20090925081539
drwxrwxr-x  2 bla bla  4096 Sep 25 09:15 snapshot.20090925091538
drwxrwxr-x  2 bla bla  4096 Sep 25 09:52 snapshot.20090925095213
drwxrwxr-x  2 bla bla  4096 Sep 25 10:15 snapshot.20090925101540
drwxrwxr-x  2 bla bla  4096 Sep 25 11:15 snapshot.20090925111538
drwxrwxr-x  2 bla bla  4096 Sep 25 00:15 snapshot.20090925121534
drwxrwxr-x  2 bla bla  4096 Sep 25 12:15 snapshot.20090925121538
{code}

      was (Author: archon810):
    I haven't changed any configs yet, and this probably doesn't come as a shock to you guys,
but the master just ran out of space. Upon inspection, I found 30+ snapshot dirs sitting around
in /data.
  
> Java Replication error: NullPointerException SEVERE: SnapPull failed on 2009-09-22 nightly
> ------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1458
>                 URL: https://issues.apache.org/jira/browse/SOLR-1458
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 1.4
>         Environment: CentOS x64
> 8GB RAM
> Tomcat, running with 7G max memory; memory usage is <2GB, so it's not the problem
> Host a: master
> Host b: slave
> Multiple single core Solr instances, using JNDI.
> Java replication
>            Reporter: Artem Russakovskii
>            Assignee: Noble Paul
>             Fix For: 1.4
>
>         Attachments: SOLR-1458.patch, SOLR-1458.patch, SOLR-1458.patch, SOLR-1458.patch,
SolrDeletionPolicy.patch, SolrDeletionPolicy.patch
>
>
> After finally figuring out the new Java based replication, we have started both the slave
and the master and issued optimize to all master Solr instances. This triggered some replication
to go through just fine, but it looks like some of it is failing.
> Here's what I'm getting in the slave logs, repeatedly for each shard:
> {code} 
> SEVERE: SnapPull failed 
> java.lang.NullPointerException
>         at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:271)
>         at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:258)
>         at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> {code} 
> If I issue an optimize again on the master to one of the shards, it then triggers a replication
and replicates OK. I have a feeling that these SnapPull failures appear later on but right
now I don't have enough to form a pattern.
> Here's replication.properties on one of the failed slave instances.
> {code}
> cat data/replication.properties 
> #Replication details
> #Wed Sep 23 19:35:30 PDT 2009
> replicationFailedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016
> previousCycleTimeInSeconds=0
> timesFailed=113
> indexReplicatedAtList=1253759730020,1253759700018,1253759670019,1253759640018,1253759610018,1253759580022,1253759550019,1253759520016,1253759490026,1253759460016
> indexReplicatedAt=1253759730020
> replicationFailedAt=1253759730020
> lastCycleBytesDownloaded=0
> timesIndexReplicated=113
> {code}
> and another
> {code}
> cat data/replication.properties 
> #Replication details
> #Wed Sep 23 18:42:01 PDT 2009
> replicationFailedAtList=1253756490034,1253756460169
> previousCycleTimeInSeconds=1
> timesFailed=2
> indexReplicatedAtList=1253756521284,1253756490034,1253756460169
> indexReplicatedAt=1253756521284
> replicationFailedAt=1253756490034
> lastCycleBytesDownloaded=22932293
> timesIndexReplicated=3
> {code}
> Some relevant configs:
> In solrconfig.xml:
> {code}
> <!-- For docs see http://wiki.apache.org/solr/SolrReplication -->
>   <requestHandler name="/replication" class="solr.ReplicationHandler" >
>     <lst name="master">
>         <str name="enable">${enable.master:false}</str>
>         <str name="replicateAfter">optimize</str>
>         <str name="backupAfter">optimize</str>
>         <str name="commitReserveDuration">00:00:20</str>
>     </lst>
>     <lst name="slave">
>         <str name="enable">${enable.slave:false}</str>
>         <!-- url of master, from properties file -->
>         <str name="masterUrl">${master.url}</str>
>         <!-- how often to check master -->
>         <str name="pollInterval">00:00:30</str>
>     </lst>
>   </requestHandler>
> {code}
> The slave then has this in solrcore.properties:
> {code}
> enable.slave=true
> master.url=URLOFMASTER/replication
> {code}
> and the master has
> {code}
> enable.master=true
> {code}
> I'd be glad to provide more details but I'm not sure what else I can do.  SOLR-926 may
be relevant.
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message