Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E99FDCC2 for ; Sun, 5 Aug 2012 22:00:02 +0000 (UTC) Received: (qmail 68340 invoked by uid 500); 5 Aug 2012 22:00:02 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 68315 invoked by uid 500); 5 Aug 2012 22:00:02 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 68306 invoked by uid 99); 5 Aug 2012 22:00:02 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Aug 2012 22:00:02 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 4500114052C for ; Sun, 5 Aug 2012 22:00:02 +0000 (UTC) Date: Sun, 5 Aug 2012 22:00:02 +0000 (UTC) From: "Josh Elser (JIRA)" To: dev@accumulo.apache.org Message-ID: <216303667.15182.1344204002287.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1542994816.15175.1344202803068.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (ACCUMULO-716) Corrupt WAL file MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-716?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13= 428916#comment-13428916 ]=20 Josh Elser commented on ACCUMULO-716: ------------------------------------- Looking at my tserver's debug log, I can see that processing one of the WAL= entries failed with the below exception: {noformat} 05 17:09:05,604 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent= ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20 05 17:09:05,706 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent= ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20 05 17:09:05,707 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent= ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20 05 17:09:05,728 [fs.FSInputChecker] INFO : Found checksum error: b[0, 512]= =3D0000012c99d6b958000000000e080010011a083139363630393234000000010000000007= 64657461636f6c0000003f000000045445585400000009313900656e77696b690000000a616= c6c7c656e77696b69010000012c99d6b958000000000e080010011a08313936363039323400= 0000010000000006676e69766f6d0000003f000000045445585400000009313900656e77696= b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139= 363630393234000000010000000004657375660000003f00000004544558540000000931390= 0656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010= 011a08313936363039323400000001000000000565676e61720000003f00000004544558540= 0000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b9580000= 00000e080010011a08313936363039323400000001000000000665726f6665620000003f000= 000045445585400000009313900656e77696b690000000a616c6c7c656e77696b6901000001= 2c99d6b958000000000e080010011a083139363630393234000000010000000004656e6f740= 000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b= 69010000012c99d6b958000000000e080010011a083139363630he.hadoop.fs.FSInputChe= cker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClien= t.java:1385) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSCl= ient.java:2121) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.j= ava:2173) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:= 443) at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFil= eValue.java:42) at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess= or.sort(LogSorter.java:122) at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess= or.process(LogSorter.java:87) at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(= DistributedWorkQueue.java:101) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoo= lExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe= cutor.java:908) at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(Trac= eRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnabl= e.java:34) at java.lang.Thread.run(Thread.java:662) 05 17:09:05,729 [hdfs.DFSClient] WARN : Found Checksum error for blk_-75885= 99586768272616_79756 from 127.0.0.1:50010 at 1170472960 05 17:09:05,731 [hdfs.DFSClient] INFO : Could not obtain block blk_-7588599= 586768272616_79756 from any node: java.io.IOException: No live nodes contai= n current block. Will get new block locations from namenode and retry...] D= EBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tabl= ets:1 ranges:1)=20 05 17:09:05,810 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent= ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20 05 17:09:05,913 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent= ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20 ... ... 05 17:09:08,805 [log.LogSorter] ERROR: org.apache.hadoop.fs.ChecksumExcepti= on: Checksum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.= 1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960umException: Check= sum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.1+9997/4e= 9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.jav= a:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChe= cker.java:241) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176= ) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:19= 3) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158= ) at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClien= t.java:1385) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSCl= ient.java:2121) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.j= ava:2173) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:= 443) at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFil= eValue.java:42) at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess= or.sort(LogSorter.java:122) at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess= or.process(LogSorter.java:87) at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(= DistributedWorkQueue.java:101) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoo= lExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe= cutor.java:908) at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(Trac= eRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnabl= e.java:34) at java.lang.Thread.run(Thread.java:662) {noformat} The offending entry in HDFS is: {noformat} $ hadoop fs -ls /accumulo-1.5/wal/127.0.0.1+9997 (svn)-[trunk:13= 69523] 12/08/05 17:42:28 INFO security.UserGroupInformation: JAAS Configuration al= ready set up for Hadoop, not re-installing. Found 6 items ... -rw-r--r-- 2 elserj supergroup 0 2012-08-05 16:42 /accumulo-1.5/= wal/127.0.0.1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66 ... {noformat} The fun part is if I `hadoop fs -text` that file, there's definitely conten= t in it. Making this slightly more confusing too, I had moved some files fr= om one dfs.data.dir to another dfs.data.dir to reclaim some space. But in t= he spirit of full disclosure, I did something along the lines of the follow= ing (which, to my knowledge, shouldn't have messed anything up): {noformat} $ mv /dfs1/blocks/subdir44/* /dfs2/blocks/subdir44/ {noformat} I also see that the block whose checksum failed in the stack trace (from ab= ove), is still listed in the partition's drive who filled up: {noformat} $ find /dfs* -name "blk_-7588599586768272616*" = - /dfs1/blocksBeingWritten/blk_-7588599586768272616_79756.meta /dfs1/blocksBeingWritten/blk_-7588599586768272616 {noformat} Not sure if this is something inside Accumulo itself, or if this is an HDFS= bug. Given experiences of ensuring files in HDFS being fully written (clos= ing the file, re-opening, seeking to the end), I thought there may be somet= hing that Accumulo could do to prevent this from happening. =20 > Corrupt WAL file > ---------------- > > Key: ACCUMULO-716 > URL: https://issues.apache.org/jira/browse/ACCUMULO-716 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.5.0 > Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3 > Reporter: Josh Elser > Assignee: Eric Newton > > Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and thing= s failed not-so-gracefully. Upon restart, log recovery started, appeared to= finish (failed HDFS checksum on one WAL entry), and left Accumulo in a sta= te where no tablets were assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira