accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3182) WAL repeatedly failed recovery due to NPE in IT
Date Mon, 29 Sep 2014 15:50:33 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151824#comment-14151824
] 

Josh Elser commented on ACCUMULO-3182:
--------------------------------------

bq. If we're 100% certain about it registering in the metadata table after it's opened

This I am not 100% sure about -- would need to go digging in code or get some verification
from others.

bq. then we just need to put in a sync as part of it opening.

As I said earlier, I believe that's insufficient. There is still the opportunity to die mid-write
or mid-sync of the header and still end up with an incomplete header which would require the
same downstream handling.

> WAL repeatedly failed recovery due to NPE in IT
> -----------------------------------------------
>
>                 Key: ACCUMULO-3182
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3182
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.0
>
>
> Haven't ever seen this one before. A replication IT failed -- looking into it, it was
because the tserver that came up (after killing the original) failed to complete recovery.
The below happened a few times before the test ultimately timed out.
> {noformat}
> 2014-09-29 04:46:10,259 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/f98e79c4-9dcd-4fb0-8ec9-5804f0818839/recovery
> 2014-09-29 04:46:10,340 [zookeeper.DistributedWorkQueue] DEBUG: got lock for af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] DEBUG: Sorting file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
using sortId af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] INFO : Copying file:/var/lib/jenkins/home/jobs/Accumulo-Master-Integration-Tests/workspace/test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,345 [log.LogSorter] ERROR: java.io.EOFException
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.accumulo.tserver.log.DfsLogger.readHeaderAndReturnStream(DfsLogger.java:282)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:113)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-09-29 04:46:10,346 [log.LogSorter] ERROR: Error during cleanup sort/copy af53bf1e-c293-463b-b4de-5efdb8b34962
> java.lang.NullPointerException
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.close(LogSorter.java:183)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:151)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message