Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Date: Sun, 5 Aug 2012 22:00:02 +0000 (UTC)
From: "Josh Elser (JIRA)" <jira@apache.org>
To: dev@accumulo.apache.org
Message-ID: <216303667.15182.1344204002287.JavaMail.jiratomcat@issues-vm>
In-Reply-To: <1542994816.15175.1344202803068.JavaMail.jiratomcat@issues-vm>
Subject: [jira] [Commented] (ACCUMULO-716) Corrupt WAL file
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/ACCUMULO-716?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13=
428916#comment-13428916 ]=20

Josh Elser commented on ACCUMULO-716:
-------------------------------------

Looking at my tserver's debug log, I can see that processing one of the WAL=
 entries failed with the below exception:

{noformat}
05 17:09:05,604 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent=
ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20
05 17:09:05,706 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent=
ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20
05 17:09:05,707 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent=
ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20
05 17:09:05,728 [fs.FSInputChecker] INFO : Found checksum error: b[0, 512]=
=3D0000012c99d6b958000000000e080010011a083139363630393234000000010000000007=
64657461636f6c0000003f000000045445585400000009313900656e77696b690000000a616=
c6c7c656e77696b69010000012c99d6b958000000000e080010011a08313936363039323400=
0000010000000006676e69766f6d0000003f000000045445585400000009313900656e77696=
b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010011a083139=
363630393234000000010000000004657375660000003f00000004544558540000000931390=
0656e77696b690000000a616c6c7c656e77696b69010000012c99d6b958000000000e080010=
011a08313936363039323400000001000000000565676e61720000003f00000004544558540=
0000009313900656e77696b690000000a616c6c7c656e77696b69010000012c99d6b9580000=
00000e080010011a08313936363039323400000001000000000665726f6665620000003f000=
000045445585400000009313900656e77696b690000000a616c6c7c656e77696b6901000001=
2c99d6b958000000000e080010011a083139363630393234000000010000000004656e6f740=
000003f000000045445585400000009313900656e77696b690000000a616c6c7c656e77696b=
69010000012c99d6b958000000000e080010011a083139363630he.hadoop.fs.FSInputChe=
cker.read(FSInputChecker.java:158)
        at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClien=
t.java:1385)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSCl=
ient.java:2121)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.j=
ava:2173)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:=
443)
        at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFil=
eValue.java:42)
        at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess=
or.sort(LogSorter.java:122)
        at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess=
or.process(LogSorter.java:87)
        at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(=
DistributedWorkQueue.java:101)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoo=
lExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe=
cutor.java:908)
        at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(Trac=
eRunnable.java:47)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnabl=
e.java:34)
        at java.lang.Thread.run(Thread.java:662)
05 17:09:05,729 [hdfs.DFSClient] WARN : Found Checksum error for blk_-75885=
99586768272616_79756 from 127.0.0.1:50010 at 1170472960
05 17:09:05,731 [hdfs.DFSClient] INFO : Could not obtain block blk_-7588599=
586768272616_79756 from any node: java.io.IOException: No live nodes contai=
n current block. Will get new block locations from namenode and retry...] D=
EBUG: MultiScanSess null 0 entries in 0.00 secs (lookup_time:0.00 secs tabl=
ets:1 ranges:1)=20
05 17:09:05,810 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent=
ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20
05 17:09:05,913 [tabletserver.TabletServer] DEBUG: MultiScanSess null 0 ent=
ries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)=20
...
...
05 17:09:08,805 [log.LogSorter] ERROR: org.apache.hadoop.fs.ChecksumExcepti=
on: Checksum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.=
1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960umException: Check=
sum error: /blk_-7588599586768272616:of:/accumulo-1.5/wal/127.0.0.1+9997/4e=
9df088-8c15-4499-bd9c-4a664cfe1a66 at 1170472960
        at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.jav=
a:277)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChe=
cker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176=
)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:19=
3)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158=
)
        at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.read(DFSClien=
t.java:1385)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSCl=
ient.java:2121)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.j=
ava:2173)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.accumulo.core.data.Mutation.readFields(Mutation.java:=
443)
        at org.apache.accumulo.server.logger.LogFileValue.readFields(LogFil=
eValue.java:42)
        at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess=
or.sort(LogSorter.java:122)
        at org.apache.accumulo.server.tabletserver.log.LogSorter$LogProcess=
or.process(LogSorter.java:87)
        at org.apache.accumulo.server.zookeeper.DistributedWorkQueue$1.run(=
DistributedWorkQueue.java:101)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoo=
lExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe=
cutor.java:908)
        at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(Trac=
eRunnable.java:47)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnabl=
e.java:34)
        at java.lang.Thread.run(Thread.java:662)
{noformat}

The offending entry in HDFS is:

{noformat}
$ hadoop fs -ls /accumulo-1.5/wal/127.0.0.1+9997            (svn)-[trunk:13=
69523]
12/08/05 17:42:28 INFO security.UserGroupInformation: JAAS Configuration al=
ready set up for Hadoop, not re-installing.
Found 6 items
...
-rw-r--r--   2 elserj supergroup          0 2012-08-05 16:42 /accumulo-1.5/=
wal/127.0.0.1+9997/4e9df088-8c15-4499-bd9c-4a664cfe1a66
...
{noformat}

The fun part is if I `hadoop fs -text` that file, there's definitely conten=
t in it. Making this slightly more confusing too, I had moved some files fr=
om one dfs.data.dir to another dfs.data.dir to reclaim some space. But in t=
he spirit of full disclosure, I did something along the lines of the follow=
ing (which, to my knowledge, shouldn't have messed anything up):

{noformat}
$ mv /dfs1/blocks/subdir44/* /dfs2/blocks/subdir44/
{noformat}

I also see that the block whose checksum failed in the stack trace (from ab=
ove), is still listed in the partition's drive who filled up:

{noformat}
$ find /dfs* -name "blk_-7588599586768272616*"                             =
                -
/dfs1/blocksBeingWritten/blk_-7588599586768272616_79756.meta
/dfs1/blocksBeingWritten/blk_-7588599586768272616
{noformat}


Not sure if this is something inside Accumulo itself, or if this is an HDFS=
 bug. Given experiences of ensuring files in HDFS being fully written (clos=
ing the file, re-opening, seeking to the end), I thought there may be somet=
hing that Accumulo could do to prevent this from happening.
               =20
> Corrupt WAL file
> ----------------
>
>                 Key: ACCUMULO-716
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-716
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0
>         Environment: java version "1.6.0_33", hadoop-0.20.2-cdh3u3
>            Reporter: Josh Elser
>            Assignee: Eric Newton
>
> Ran wikisearch-ingest. Ended up filling up a drive used by HDFS and thing=
s failed not-so-gracefully. Upon restart, log recovery started, appeared to=
 finish (failed HDFS checksum on one WAL entry), and left Accumulo in a sta=
te where no tablets were assigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp=
a
For more information on JIRA, see: http://www.atlassian.com/software/jira