accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wall <mjw...@gmail.com>
Subject Re: recovering Accumulo instance from missing root WALs (deleted by gc)
Date Fri, 21 Apr 2017 12:24:20 GMT
Jonathan,

Sorry you are having problems here.  What version are you using?  Do you
have hdfs trash turned on?  If so, look for those logs in the trash.  If
you find them, simply move them back.

If you can't find them, then the data is gone.  You can try an "hdfs dfs
-touchz <path>" on the locations and things should recover.  But again,
there will be data loss on the root, which will cascade to the metadata
table and so on.  Typically that means an old copy, so if you can determine
when the problem started and replay all the ingest since that time, you can
recover.

Take a look at https://issues.apache.org/jira/browse/ACCUMULO-4157 and see
if this seems like what happened.

Mike


On Fri, Apr 21, 2017 at 7:43 AM Jonathan LASKO <jonathan.lasko@raytheon.com>
wrote:

> Hello Accumulo wizards,
>
> I have a large schema of test data in an Accumulo instance that is
> currently inaccessible which I would like to recover, if possible. I'll
> explain the problem in hopes that some folks who know the intricacies of
> the Accumulo root table, WAL, and recovery processes can tell me whether
> there are any additional actions to take or whether I should treat this
> schema as hosed.
>
> The problem is similar to what was reported here (
> https://community.hortonworks.com/questions/52718/failed-to-locate-tablet-for-table-0-row-err.html),
> i.e. no tablets are loaded except one from accumulo.root, and the logs are
> repeating these message rapidly:
>
> ==> monitor_stti-master.bbn.com.debug.log <==
> 2017-04-21 07:10:55,047 [impl.ThriftScanner] DEBUG:  Failed to locate
> tablet for table : !0 row : ~err_
>
> ==> master_stti-master.bbn.com.debug.log <==
> 2017-04-21 07:10:55,430 [master.Master] DEBUG: Finished gathering
> information from 13 servers in 0.03 seconds
> 2017-04-21 07:10:55,430 [master.Master] DEBUG: not balancing because there
> are unhosted tablets: 2
>
> The RecoveryManager insists that it is trying to recover five WALs:
>
> 2017-04-21 07:28:48,349 [recovery.RecoveryManager] DEBUG: Recovering
> hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a2017-04-21>
> 07:28:48,358 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e9
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e92017-04-21>
> 07:28:48,362 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc3
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc32017-04-21>
> 07:28:48,366 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a2017-04-21>
> 07:28:48,369 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/6f392ec5-821b-4fd5-83e4-baf1f47d8105
>
> Based on the advice from the post linked above, I grepped the logs and was
> able to confirm that all five of those WALs were actually deleted (here's
> the output from my grep; note the earlier timestamps):
>
> gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,275
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105]
> from stti-data-102.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,280
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3]
> from stti-data-103.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-03 20:25:26,699
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a]
> from stti-data-103.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:32:11,106
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a]
> from stti-data-102.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:37:14,875
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9]
> from stti-data-103.bbn.com+10011
>
> All five WALs appear in references in the accumulo.root table:
>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a%7C1>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9%7C1>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3%7C1>
> ...
> !0< log:
> stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a%7C1>
> !0< log:
> stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105%7C1>
>
> I also see observe three outstanding fate transactions (at least two of
> which appear to me to be related to the accumulo.root table):
>
> root@bbn-beta> fate print
> txid: 6b33fa130909f05d  status: IN_PROGRESS         op: CompactRange
>  locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
> txid: 564d758d584af61e  status: IN_PROGRESS         op: CompactRange
>  locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
> txid: 4a620317a53a4a93  status: IN_PROGRESS         op: CreateTable
> locked: [W:5e, R:+default] locking: []              top: PopulateMetadata
>
> I checked in ZooKeeper and the /accumulo/$INSTANCE/root_tablet/walogs and
> /accumulo/$INSTANCE/recovery/[locks] directories are all empty.
>
> I don't know exactly what to do at this point. I could:
>
> a) Try deleting the fate operations and see if that releases the Accumulo
> instance.
> b) Try deleting the accumulo.root table entries pointing to the
> already-deleted WALs.
> c) Call it quits on this instance, blow it away, and start re-generating
> my test data over the weekend.
>
> Given option (c), I would most likely try options (a) and (b) first (and
> probably in that order). But I would love to get some insight from the
> Accumulo experts first.
>
> Thanks in advance,
>
> Jonathan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message