accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3182) Empty or partial WAL header blocks successful recovery
Date Thu, 02 Oct 2014 22:56:34 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157353#comment-14157353
] 

Josh Elser commented on ACCUMULO-3182:
--------------------------------------

It appears that I incorrectly assumed that 1.5 would also have this issue in the first place.
Best as I can tell it does not. It seems like this was introduced with some of the crypto-related
changes.

In 1.5, if we can't read the header or there is no header (was this previously the case for
WALs?), we'll return an InputStream seek'ed to '0'. Then, the LogSorter will attempt to read
pairs of LogFileKeys and LogFileValues which handles an EOFException. In the catch, it will
also create the empty MapFile so that recovery will gracefully complete as well.

> Empty or partial WAL header blocks successful recovery
> ------------------------------------------------------
>
>                 Key: ACCUMULO-3182
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3182
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.1
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.2, 1.7.0
>
>         Attachments: 0001-ACCUMULO-3182-Gracefully-handles-incomplete-missing-.patch
>
>
> Haven't ever seen this one before. A replication IT failed -- looking into it, it was
because the tserver that came up (after killing the original) failed to complete recovery.
The below happened a few times before the test ultimately timed out.
> {noformat}
> 2014-09-29 04:46:10,259 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/f98e79c4-9dcd-4fb0-8ec9-5804f0818839/recovery
> 2014-09-29 04:46:10,340 [zookeeper.DistributedWorkQueue] DEBUG: got lock for af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] DEBUG: Sorting file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
using sortId af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] INFO : Copying file:/var/lib/jenkins/home/jobs/Accumulo-Master-Integration-Tests/workspace/test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,345 [log.LogSorter] ERROR: java.io.EOFException
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.accumulo.tserver.log.DfsLogger.readHeaderAndReturnStream(DfsLogger.java:282)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:113)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-09-29 04:46:10,346 [log.LogSorter] ERROR: Error during cleanup sort/copy af53bf1e-c293-463b-b4de-5efdb8b34962
> java.lang.NullPointerException
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.close(LogSorter.java:183)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:151)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message