accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] andrewglowacki opened a new issue #949: Network issue during WAL creation results in a 'missing' WAL during recovery
Date Sat, 09 Feb 2019 04:59:37 GMT
andrewglowacki opened a new issue #949: Network issue during WAL creation results in a 'missing'
WAL during recovery
URL: https://github.com/apache/accumulo/issues/949
 
 
   As far as I can tell this is what's happening...
   
   When a new WAL is being created, after the header and OPEN mark are written, the new WAL
marker is written to Zookeeper for the master. If due to a network interruption, the marker
is written, but the tserver is unaware of this, the tserver will delete the WAL from HDFS,
leaving an orphaned entry in the metadata table. This then prevents Accumulo from proceeding
with ingest for the associated tablets without manual intervention, because it thinks it's
missing a WAL.
   
   This was observed twice in the last three weeks on a moderately sized cluster. Why is the
WAL deleted by the tserver, shouldn't the GC do this? Maybe it should only delete the WAL
if it doesn't fail on the Zookeeper step?
   
   Note: this only seems to happen in rare circumstances when the cluster is under heavy load.
   
   Version 1.9.2
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message