Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9CCCA200C24 for ; Thu, 23 Feb 2017 14:19:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 9B5AB160B62; Thu, 23 Feb 2017 13:19:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B6904160B50 for ; Thu, 23 Feb 2017 14:19:49 +0100 (CET) Received: (qmail 6166 invoked by uid 500); 23 Feb 2017 13:19:48 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 6155 invoked by uid 99); 23 Feb 2017 13:19:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Feb 2017 13:19:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4C80B1A0292 for ; Thu, 23 Feb 2017 13:19:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 7jfW0PQdMI2h for ; Thu, 23 Feb 2017 13:19:46 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 10B4C5FAD8 for ; Thu, 23 Feb 2017 13:19:46 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0E900E00D6 for ; Thu, 23 Feb 2017 13:19:45 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3511F24138 for ; Thu, 23 Feb 2017 13:19:44 +0000 (UTC) Date: Thu, 23 Feb 2017 13:19:44 +0000 (UTC) From: "Sean Busbey (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3182) Empty or partial WAL header blocks successful recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 23 Feb 2017 13:19:50 -0000 [ https://issues.apache.org/jira/browse/ACCUMULO-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880405#comment-15880405 ] Sean Busbey commented on ACCUMULO-3182: --------------------------------------- Work around for folks hitting this on earlier releases: replace the empty WAL or malformed WAL files with a "complete" empty WAL file. On a cluster edge node with appropriate client configs and a system user that owns the HDFS files for the accumulo processes: First generate the file locally. {code} [accumulo@gateway-1 ~]$ echo -n -e '--- Log File Header (v2) ---\x00\x00\x00\x00' > empty.wal {code} You should examine the file with a hex editor to ensure it is exactly the text "--- Log File Header (v2) ---" followed by four (4) 0x00 bytes. Now we copy it into hdfs and then use a hdfs dfs -mv to atomically put it in place. {code} [accumulo@gateway-1 ~]$ hdfs dfs -moveFromLocal empty.wal /user/accumulo/empty.wal [accumulo@gateway-1 ~]$ hdfs dfs -mv /user/accumulo/empty.wal /accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e {code} Presuming this has been failing for a while, the master will have backed off retrying (i.e. the "in : 300s" part of the recovery message). To avoid waiting for the next retry, you can restart the master process. If there is more than one WAL, you'll want to make a copy of the empty WAL file in HDFS for each one, then use the move command on the copy to put it in place for where Accumulo is looking to do recovery. > Empty or partial WAL header blocks successful recovery > ------------------------------------------------------ > > Key: ACCUMULO-3182 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3182 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.6.1 > Reporter: Josh Elser > Assignee: Josh Elser > Fix For: 1.6.2, 1.7.0 > > Attachments: 0001-ACCUMULO-3182-Gracefully-handles-incomplete-missing-.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Haven't ever seen this one before. A replication IT failed -- looking into it, it was because the tserver that came up (after killing the original) failed to complete recovery. The below happened a few times before the test ultimately timed out. > {noformat} > 2014-09-29 04:46:10,259 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/f98e79c4-9dcd-4fb0-8ec9-5804f0818839/recovery > 2014-09-29 04:46:10,340 [zookeeper.DistributedWorkQueue] DEBUG: got lock for af53bf1e-c293-463b-b4de-5efdb8b34962 > 2014-09-29 04:46:10,341 [log.LogSorter] DEBUG: Sorting file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962 to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962 using sortId af53bf1e-c293-463b-b4de-5efdb8b34962 > 2014-09-29 04:46:10,341 [log.LogSorter] INFO : Copying file:/var/lib/jenkins/home/jobs/Accumulo-Master-Integration-Tests/workspace/test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/wal/juno+49195/af53bf1e-c293-463b-b4de-5efdb8b34962 to file:/.../test/target/mini-tests/org.apache.accumulo.test.replication.UnorderedWorkAssignerReplicationIT_dataReplicatedToCorrectTableWithoutDrain/accumulo/recovery/af53bf1e-c293-463b-b4de-5efdb8b34962 > 2014-09-29 04:46:10,345 [log.LogSorter] ERROR: java.io.EOFException > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at org.apache.accumulo.tserver.log.DfsLogger.readHeaderAndReturnStream(DfsLogger.java:282) > at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:113) > at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93) > at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) > at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:745) > 2014-09-29 04:46:10,346 [log.LogSorter] ERROR: Error during cleanup sort/copy af53bf1e-c293-463b-b4de-5efdb8b34962 > java.lang.NullPointerException > at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.close(LogSorter.java:183) > at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.sort(LogSorter.java:151) > at org.apache.accumulo.tserver.log.LogSorter$LogProcessor.process(LogSorter.java:93) > at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(DistributedWorkQueue.java:105) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) > at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)