accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: walog consumes all the disk space on power failure
Date Thu, 02 Jun 2016 05:08:19 GMT
Oh. Why do you only have 16GB of space...

You might be able to tweak some of the configuration properties so that 
Accumulo is more aggressive in removing files, but I think you'd just 
kick the can down the road for another ~30minutes.

Jayesh Patel wrote:
> All 3 nodes have 16GB disk space which was 98% consumed when we looked at
> them after few hours after the power failed and was restored.  Normally it's
> only 33% or about 5GB.
> Once it got into this state Zookeeper couldn't even start because it
> couldn't create some logfiles that it needs to create.  So the disk space
> usage was real, not sure if you meant that or not.  Ended up wiping away
> hdfs data folder and reformatting it to reclaim the space.
>
> Definitely didn't see complaints about writing to WALs.  Only exception is
> the following that showed up because namenode wasn't in the right state due
> to constrained resources:
>
> 2016-05-23 07:06:17,599 [recovery.HadoopLogCloser] WARN : Error recovering
> lease on hdfs://instance-accumul
> o:8020/accumulo/wal/instance-accumulo-3+9997/530f663b-2d6b-42a5-92d6-e8fbb9b
> 55c2e
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
> .SafeModeException): Cannot rec
> over the lease of
> /accumulo/wal/instance-accumulo-3+9997/530f663b-2d6b-42a5-92d6-e8fbb9b55c2e.
> Name node is
>   in safe mode.
> Resources are low on NN. Please add or free up more resources then turn off
> safe mode manually. NOTE:  If y
> ou turn off safe mode before adding resources, the NN will immediately
> return to safe mode. Use "hdfs dfsad
> min -safemode leave" to turn safe mode off.
>          at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FS
> Namesystem.java:1327
> )
>          at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesyste
> m.java:2828)
>          at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNo
> deRpcServer.java:667
> )
>          at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
> PB.recoverLease(Clie
> ntNamenodeProtocolServerSideTranslatorPB.java:663)
>          at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
> enodeProtocol$2.call
> BlockingMethod(ClientNamenodeProtocolProtos.java)
>          at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
> bufRpcEngine.java:61
> 6)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Unknown Source)
>          at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1657)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
>          at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>          at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>          at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
> a:229)
>          at com.sun.proxy.$Proxy15.recoverLease(Unknown Source)
>          at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.recover
> Lease(ClientNamenode
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Tuesday, May 31, 2016 6:54 PM
> To: user@accumulo.apache.org
> Subject: Re: walog consumes all the disk space on power failure
>
> Hi Jayesh,
>
> Can you quantify some rough size numbers for us? Are you seeing exceptions
> in the Accumulo tserver/master logs?
>
> One thought is that when Accumulo creates new WAL files, it sets the
> blocksize to be 1G (as a trick to force HDFS into making some "non-standard"
> guarantees for us). As a result, it will appear that there are a number of
> very large WAL files (but they're essentially empty).
>
> If your instance is in some situation where Accumulo is repeatedly failing
> to write to a WAL, it might think the WAL is bad, abandon it, and try to
> create a new one. If this is happening each time, I could see it explain the
> situation you described. However, you should see the TabletServers
> complaining loudly that they cannot write to the WALs.
>
> Jayesh Patel wrote:
>> We have a 3 node Accumulo 1.7 cluster running as VMWare VMs with
>> minute amount of data compared to Accumulo standards.
>>
>> We have run into a situation multiple times now where all the nodes
>> have a power failure and when they are trying to recover from it
>> simultaneously, walog grows exponentially and fills up all the
>> available disk space. We have confirmed that the walog folder under
>> /accumulo in hdfs is consuming 99% of the disk space.
>>
>> We have tried freeing enough space to be able to run Accumulo
>> processes in the hopes of it burning through walog without success.
>> Walog just grew to take up the freed space.
>>
>> Given that we need to better manage the power situation, we're trying
>> to understand what could be causing this and if there's anything we
>> can do to avoid this situation.
>>
>> We have some heartbeat data being written to a table at a very small
>> constant rate which is not sufficient to cause a such large
>> write-ahead log even if HDFS was pulled from under Accumulo's feet, so
>> to speak during the power failure in case you're wondering.
>>
>> Thank you,
>>
>> Jayesh
>>

Mime
View raw message