accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: walog consumes all the disk space on power failure
Date Fri, 03 Jun 2016 20:55:27 GMT
It depends on how much data you're writing. I can't answer that for ya.

Generally for hadoop, you want to avoid that 80-90% utilization (HDFS 
will limit you to 90 or 95% capacity usage by default, IIRC).

If you're running things like MapReduce, you'll need more headroom to 
account for temporary output, jars being copied, etc. Accumulo has some 
lag in free'ing disk space (e.g. during compaction, you'll have double 
space usage for the files you're re-writing), as does HDFS in actually 
deleting the blocks for files that were deleted.

Jayesh Patel wrote:
> So what would you consider a safe minimum amount of disk space in this case?
>
> Thank you,
> Jayesh
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Thursday, June 02, 2016 1:08 AM
> To: user@accumulo.apache.org
> Subject: Re: walog consumes all the disk space on power failure
>
> Oh. Why do you only have 16GB of space...
>
> You might be able to tweak some of the configuration properties so that
> Accumulo is more aggressive in removing files, but I think you'd just kick
> the can down the road for another ~30minutes.
>
> Jayesh Patel wrote:
>> All 3 nodes have 16GB disk space which was 98% consumed when we looked
>> at them after few hours after the power failed and was restored.
>> Normally it's only 33% or about 5GB.
>> Once it got into this state Zookeeper couldn't even start because it
>> couldn't create some logfiles that it needs to create.  So the disk
>> space usage was real, not sure if you meant that or not.  Ended up
>> wiping away hdfs data folder and reformatting it to reclaim the space.
>>
>> Definitely didn't see complaints about writing to WALs.  Only
>> exception is the following that showed up because namenode wasn't in
>> the right state due to constrained resources:
>>
>> 2016-05-23 07:06:17,599 [recovery.HadoopLogCloser] WARN : Error
>> recovering lease on hdfs://instance-accumul
>> o:8020/accumulo/wal/instance-accumulo-3+9997/530f663b-2d6b-42a5-92d6-e
>> 8fbb9b
>> 55c2e
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.na
>> menode
>> .SafeModeException): Cannot rec
>> over the lease of
>>
> /accumulo/wal/instance-accumulo-3+9997/530f663b-2d6b-42a5-92d6-e8fbb9b55c2e.
>> Name node is
>>    in safe mode.
>> Resources are low on NN. Please add or free up more resources then
>> turn off safe mode manually. NOTE:  If y ou turn off safe mode before
>> adding resources, the NN will immediately return to safe mode. Use
>> "hdfs dfsad min -safemode leave" to turn safe mode off.
>>           at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeM
>> ode(FS
>> Namesystem.java:1327
>> )
>>           at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNam
>> esyste
>> m.java:2828)
>>           at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(
>> NameNo
>> deRpcServer.java:667
>> )
>>           at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTran
>> slator
>> PB.recoverLease(Clie
>> ntNamenodeProtocolServerSideTranslatorPB.java:663)
>>           at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cli
>> entNam
>> enodeProtocol$2.call
>> BlockingMethod(ClientNamenodeProtocolProtos.java)
>>           at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call
>> (Proto
>> bufRpcEngine.java:61
>> 6)
>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>           at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>>           at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>>           at java.security.AccessController.doPrivileged(Native Method)
>>           at javax.security.auth.Subject.doAs(Unknown Source)
>>           at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformat
>> ion.ja
>> va:1657)
>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>>
>>           at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>           at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>>           at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngi
>> ne.jav
>> a:229)
>>           at com.sun.proxy.$Proxy15.recoverLease(Unknown Source)
>>           at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.r
>> ecover
>> Lease(ClientNamenode
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: Tuesday, May 31, 2016 6:54 PM
>> To: user@accumulo.apache.org
>> Subject: Re: walog consumes all the disk space on power failure
>>
>> Hi Jayesh,
>>
>> Can you quantify some rough size numbers for us? Are you seeing
>> exceptions in the Accumulo tserver/master logs?
>>
>> One thought is that when Accumulo creates new WAL files, it sets the
>> blocksize to be 1G (as a trick to force HDFS into making some
> "non-standard"
>> guarantees for us). As a result, it will appear that there are a
>> number of very large WAL files (but they're essentially empty).
>>
>> If your instance is in some situation where Accumulo is repeatedly
>> failing to write to a WAL, it might think the WAL is bad, abandon it,
>> and try to create a new one. If this is happening each time, I could
>> see it explain the situation you described. However, you should see
>> the TabletServers complaining loudly that they cannot write to the WALs.
>>
>> Jayesh Patel wrote:
>>> We have a 3 node Accumulo 1.7 cluster running as VMWare VMs with
>>> minute amount of data compared to Accumulo standards.
>>>
>>> We have run into a situation multiple times now where all the nodes
>>> have a power failure and when they are trying to recover from it
>>> simultaneously, walog grows exponentially and fills up all the
>>> available disk space. We have confirmed that the walog folder under
>>> /accumulo in hdfs is consuming 99% of the disk space.
>>>
>>> We have tried freeing enough space to be able to run Accumulo
>>> processes in the hopes of it burning through walog without success.
>>> Walog just grew to take up the freed space.
>>>
>>> Given that we need to better manage the power situation, we're trying
>>> to understand what could be causing this and if there's anything we
>>> can do to avoid this situation.
>>>
>>> We have some heartbeat data being written to a table at a very small
>>> constant rate which is not sufficient to cause a such large
>>> write-ahead log even if HDFS was pulled from under Accumulo's feet,
>>> so to speak during the power failure in case you're wondering.
>>>
>>> Thank you,
>>>
>>> Jayesh
>>>

Mime
View raw message