accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3182) Empty or partial WAL header blocks successful recovery
Date Thu, 23 Feb 2017 13:19:44 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880405#comment-15880405
] 

Sean Busbey commented on ACCUMULO-3182:
---------------------------------------

Work around for folks hitting this on earlier releases: replace the empty WAL or malformed
WAL files with a "complete" empty WAL file.

On a cluster edge node with appropriate client configs and a system user that owns the HDFS
files for the accumulo processes:

First generate the file locally.

{code}
[accumulo@gateway-1 ~]$ echo -n -e '--- Log File Header (v2) ---\x00\x00\x00\x00' > empty.wal
{code}

You should examine the file with a hex editor to ensure it is exactly the text "--- Log File
Header (v2) ---" followed by four (4) 0x00 bytes.

Now we copy it into hdfs and then use a hdfs dfs -mv to atomically put it in place.

{code}
[accumulo@gateway-1 ~]$ hdfs dfs -moveFromLocal empty.wal /user/accumulo/empty.wal
[accumulo@gateway-1 ~]$ hdfs dfs -mv /user/accumulo/empty.wal /accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e
{code}

Presuming this has been failing for a while, the master will have backed off retrying (i.e.
the "in : 300s" part of the recovery message). To avoid waiting for the next retry, you can
restart the master process. If there is more than one WAL, you'll want to make a copy of the
empty WAL file in HDFS for each one, then use the move command on the copy to put it in place
for where Accumulo is looking to do recovery.


> Empty or partial WAL header blocks successful recovery
> ------------------------------------------------------
>
>                 Key: ACCUMULO-3182
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3182
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.1
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.2, 1.7.0
>
>         Attachments: 0001-ACCUMULO-3182-Gracefully-handles-incomplete-missing-.patch
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Haven't ever seen this one before. A replication IT failed -- looking into it, it was
because the tserver that came up (after killing the original) failed to complete recovery.
The below happened a few times before the test ultimately timed out.
> {noformat}
> 2014-09-29 04:46:10,259 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/f98e79c4-9dcd-4fb0-8ec9-5804f0818839/recovery
> 2014-09-29 04:46:10,340 [zookeeper.DistributedWorkQueue] DEBUG: got lock for af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] DEBUG: Sorting file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
using sortId af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,341 [log.LogSorter] INFO : Copying file:/var/lib/jenkins/home/jobs/Accumulo-Master-Integration-Tests/workspace/test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962
to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962
> 2014-09-29 04:46:10,345 [log.LogSorter] ERROR: java.io.EOFException
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.accumulo.tserver.log.DfsLogger.readHeaderAndReturnStream(DfsLogger.java:282)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:113)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-09-29 04:46:10,346 [log.LogSorter] ERROR: Error during cleanup sort/copy af53bf1e-c293-463b-b4de-5efdb8b34962
> java.lang.NullPointerException
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.close(LogSorter.java:183)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:151)
> 	at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93)
> 	at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message