accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Allen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1998) Exception while attempting to bring up 1.6.0 TabletServer after moderate amount of work
Date Wed, 11 Dec 2013 15:57:11 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845498#comment-13845498
] 

Michael Allen commented on ACCUMULO-1998:
-----------------------------------------

Here's more of a precise breakdown.  The exception looks like this now:

{noformat}
2013-12-11 15:46:30,871 [thrift.ProcessFunction] ERROR: Internal error processing startMultiScan
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1429)
	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1381)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
	at com.sun.proxy.$Proxy10.startMultiScan(Unknown Source)
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2252)
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2236)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171)
	at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
	at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:231)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException:
Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
	at org.apache.accumulo.tserver.TabletServer$ScanTask.get(TabletServer.java:754)
	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1416)
	... 18 more
Caused by: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1125)
	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
	... 5 more
Caused by: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:333)
	at org.apache.accumulo.tserver.FileManager.access$500(FileManager.java:58)
	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:478)
	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFileRefs(FileManager.java:466)
	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:486)
	at org.apache.accumulo.tserver.Tablet$ScanDataSource.createIterator(Tablet.java:2027)
	at org.apache.accumulo.tserver.Tablet$ScanDataSource.iterator(Tablet.java:1989)
	at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:163)
	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1565)
	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1672)
	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1114)
	... 6 more
Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/tables/!0/table_info/A00002gy.rf
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:436)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:256)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:143)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:212)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:367)
	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:143)
	at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:825)
	at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
	at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(FileOperations.java:119)
	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:314)
	... 16 more
{noformat}

Scanning the root tablet shows the following entries:

{noformat}
root@accumulo accumulo.root> scan
!0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf []    8652,1763
!0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F000063o.rf []    66848,35747
!0;~ last:142e00c604d0053 []    10.10.1.155:9997
!0;~ loc:142e00c604d0063 []    10.10.1.155:9997
!0;~ srv:compact []    7
!0;~ srv:dir []    hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info
!0;~ srv:flush []    122
!0;~ srv:lock []    tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063
!0;~ srv:time []    L39138
!0;~ ~tab:~pr []    \x00
!0< file:hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf []    1518,3
!0< last:142e00c604d000d []    10.10.1.155:9997
!0< loc:142e00c604d0063 []    10.10.1.155:9997
!0< srv:compact []    6
!0< srv:dir []    hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet
!0< srv:flush []    122
!0< srv:lock []    tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063
!0< srv:time []    L12
!0< ~tab:~pr []    \x01~
root@accumulo accumulo.root> 
{noformat}


So we looked for where this missing file (A00002gy.rf) got created.  Back up in the tserver
logs is this:

{noformat}
2013-12-11 05:31:55,701 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf,
hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf_tmp
 []
2013-12-11 05:31:55,937 [tserver.TabletServer] DEBUG: ScanSess tid 10.10.1.155:58190 !0 0
entries in 0.10 secs, nbTimes = [98 98 98.00 1]
2013-12-11 05:31:56,208 [util.TServerUtils] INFO : Decreasing server thread pool size on TabletServer
to 20
2013-12-11 05:31:56,246 [tserver.TabletServer] DEBUG: MultiScanSess 10.10.1.155:60638 2 entries
in 0.01 secs (lookup_time:0.00 secs tablets:1 ranges:1)
2013-12-11 05:31:56,333 [tserver.Compactor] DEBUG: Compaction !0;~< 20,974 read | 1,763
written | 42,979 entries/sec |  0.488 secs
2013-12-11 05:31:56,345 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs
2013-12-11 05:31:56,345 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf,
hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
{noformat}

Then it got major compacted away here:

{noformat}
2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: Major compaction plan: [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf,
hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] propogate deletes : false
2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, wait 0.00 secs
2013-12-11 05:36:56,908 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf,
hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf_tmp
 []
2013-12-11 05:36:57,080 [tserver.Compactor] DEBUG: Compaction !0;~< 37,692 read | 0 written
| 269,228 entries/sec |  0.140 secs
2013-12-11 05:36:57,085 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs
2013-12-11 05:36:57,085 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf,
hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf
{noformat}

The garbage collector decided to collect it:

{noformat}
2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf
2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/F00005yk.rf
2013-12-11 05:41:56,874 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf
2013-12-11 05:41:56,883 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates
for deletion: 4
2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates
still in use: 0
2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of successfully deleted
data files: 4
2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data files delete failures:
0
2013-12-11 05:41:56,957 [gc.SimpleGarbageCollector] INFO : Collect cycle took 0.29 seconds
{noformat}

So what happened with {{A00005ym.rf}}, which supposedly replaced {{A00002gy.rf}}?

The namenode seems to think it was created and closed.  But no other Hadoop log mentions it:

{noformat}
[root@ip-10-10-1-155 hadoop]# pwd
/data0/logs/hadoop

[root@ip-10-10-1-155 hadoop]# grep A00005ym.rf *
hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,061 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* allocateBlock: /accumulo/tables/!0/table_info/A00005ym.rf_tmp. blk_-5468581853400342461_1318
hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange:
Removing lease on  /accumulo/tables/!0/table_info/A00005ym.rf_tmp from client DFSClient_NONMAPREDUCE_-1463880682_12
hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange:
DIR* completeFile: /accumulo/tables/!0/table_info/A00005ym.rf_tmp is closed by DFSClient_NONMAPREDUCE_-1463880682_12

[root@ip-10-10-1-155 hadoop]#
{noformat}

So somewhere between the major compaction finishing and the root tablet's information being
updated, something went wrong.  There are no other obvious exceptions or errors in the logs.


> Exception while attempting to bring up 1.6.0 TabletServer after moderate amount of work
> ---------------------------------------------------------------------------------------
>
>                 Key: ACCUMULO-1998
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1998
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Michael Allen
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> The reproduction steps around this are a little bit fuzzy but basically we ran a moderate
workload against a 1.6.0 server.  Encryption happened to be turned on but that doesn't seem
to be germane to the problem.  After doing a moderate amount of work, Accumulo is refusing
to start up, spewing this error over and over to the log:
> {noformat}
> 2013-12-10 10:23:02,529 [tserver.TabletServer] WARN : exception while doing multi-scan

> java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.115:9000/accumulo/tables/!0/table_info/A000042x.rf
> 	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1125)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Failed to open hdfs://10.10.1.115:9000/accumulo/tables/!0/table_info/A000042x.rf
> 	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:333)
> 	at org.apache.accumulo.tserver.FileManager.access$500(FileManager.java:58)
> 	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:478)
> 	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFileRefs(FileManager.java:466)
> 	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:486)
> 	at org.apache.accumulo.tserver.Tablet$ScanDataSource.createIterator(Tablet.java:2027)
> 	at org.apache.accumulo.tserver.Tablet$ScanDataSource.iterator(Tablet.java:1989)
> 	at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:163)
> 	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1565)
> 	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1672)
> 	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1114)
> 	... 6 more
> Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/tables/!0/table_info/A000042x.rf
> 	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:436)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:256)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:143)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:212)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:367)
> 	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:143)
> 	at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:825)
> 	at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
> 	at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(FileOperations.java:119)
> 	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:314)
> 	... 16 more
> {noformat}
> Here's some other pieces of context:
> HDFS contents:
> {noformat}
> ubuntu@ip-10-10-1-115:/data0/logs/accumulo$ hadoop fs -lsr /accumulo/tables/
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:32 /accumulo/tables/!0
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 01:06 /accumulo/tables/!0/default_tablet
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 10:49 /accumulo/tables/!0/table_info
> -rw-r--r--   5 accumulo hadoop       1698 2013-12-10 00:34 /accumulo/tables/!0/table_info/F0000000.rf
> -rw-r--r--   5 accumulo hadoop      43524 2013-12-10 01:53 /accumulo/tables/!0/table_info/F000062q.rf
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:32 /accumulo/tables/+r
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 10:45 /accumulo/tables/+r/root_tablet
> -rw-r--r--   5 accumulo hadoop       2070 2013-12-10 10:45 /accumulo/tables/+r/root_tablet/A0000738.rf
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:33 /accumulo/tables/1
> drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:33 /accumulo/tables/1/default_tablet
> {noformat}
> ZooKeeper entries
> {noformat}
> [zk: localhost:2181(CONNECTED) 6] get /accumulo/371cfa3e-fe96-4a50-92e9-da7572589ffa/root_tablet/dir

> hdfs://10.10.1.115:9000/accumulo/tables/+r/root_tablet
> cZxid = 0x1b
> ctime = Tue Dec 10 00:32:56 EST 2013
> mZxid = 0x1b
> mtime = Tue Dec 10 00:32:56 EST 2013
> pZxid = 0x1b
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 54
> numChildren = 0
> {noformat}
> I'm going to preserve the state of this machine in HDFS for a while but not forever,
so if there are other pieces of context people need, let me know.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message