airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christie, Marcus Aaron" <machr...@iu.edu>
Subject Re: Zookeeper log filling up disk
Date Tue, 30 May 2017 19:17:21 GMT
It occurs to me that this may be related to the large number of TIME_WAIT connections we’ve
seen before [1].  Since this error was creating and closing several connections per second
it ended up generating lots of connections in TIME_WAIT.

I just thought I would mention this in case it comes up again. I don’t think we looked at
Kafka/Logstash last time the TIME_WAIT problem occurred.

[1] https://issues.apache.org/jira/browse/AIRAVATA-2321

On May 26, 2017, at 4:47 PM, Christie, Marcus Aaron <machrist@iu.edu<mailto:machrist@iu.edu>>
wrote:

Dev,

This message is just to document what happened today in the SGRC dev environment.  But if
you have any insight on what caused this to happen, please share.

TLDR: Zookeeper log fills disk, Logstash is spamming ZK with requests, not sure what caused
it but have reconfigured ZK logging to rotate files and not fill up disk.

So today in our dev environment the Zookeeper server’s log file filled up the disk. The
size was about 190GB.  It wasn’t being rotated so it has possibly been growing for a while.
On the other hand I saw in the log file that there were several messages a second, that looked
like this:

2017-05-26 11:42:35,070 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008]
- Closed socket connection for client /127.0.0.1:46462 (no session established for client)
2017-05-26 11:42:35,070 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192]
- Accepted socket connection from /127.0.0.1:46464
2017-05-26 11:42:35,071 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362]
- Exception causing close of session 0x0 due to java.io.IOException: Unreasonable length =
1684371039
2017-05-26 11:42:35,071 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008]
- Closed socket connection for client /127.0.0.1:46464 (no session established for client)
2017-05-26 11:42:35,071 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192]
- Accepted socket connection from /127.0.0.1:46466
2017-05-26 11:42:35,071 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362]
- Exception causing close of session 0x0 due to java.io.IOException: Unreasonable length =
1684371039
2017-05-26 11:42:35,071 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008]
- Closed socket connection for client /127.0.0.1:46466 (no session established for client)
2017-05-26 11:42:35,072 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192]
- Accepted socket connection from /127.0.0.1:46468
2017-05-26 11:42:35,072 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362]
- Exception causing close of session 0x0 due to java.io.IOException: Unreasonable length =
1684371039
2017-05-26 11:42:35,072 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008]
- Closed socket connection for client /127.0.0.1:46468 (no session established for client)
…


I shut down api-orch and gfac and Kafka (which is just pushing log messages to Logstash).
 I then deleted the ./data directory in the Zookeeper installation and restarted it. Still
getting those error messages above.

Eventually I found that Logstash was apparently trying to make Zookeeper connections. So I
shut it down as well.  Once I shut down Logstash the error messages stop in the Zookeeper
log.

It’s hard to say whether the problem was
1. Logstash inappropriately sending too large messages to Zookeeper
2. Or, Zookeeper’s log file fills up disk space causing Zookeeper’s database to become
corrupted. Once the disk fills up, all sorts of weird behavior can start to manifest.

I’ve reconfigured the Zookeeper logging to log to a rotated file, that rotates at 10MB and
keeps a max of 10 rotated log files.  That should prevent running out of disk space. Used
this [1] as a resource.  I created issue AIRAVATA-2411 [2] to incorporate this into our Ansible
scripts.


Thanks,

Marcus


[1] https://community.hortonworks.com/content/supportkb/49091/zookeeperout-file-keeps-growing-until-restarted.html
[2] https://issues.apache.org/jira/browse/AIRAVATA-2411


Mime
View raw message