ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitro Lisnichenko" <dlysniche...@hortonworks.com>
Subject Re: Review Request 33435: Ambari restart/stop operation loses control of Flume agents
Date Wed, 22 Apr 2015 14:23:35 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33435/#review81156
-----------------------------------------------------------

Ship it!


Ship It!

- Dmitro Lisnichenko


On April 22, 2015, 2:07 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33435/
> -----------------------------------------------------------
> 
> (Updated April 22, 2015, 2:07 p.m.)
> 
> 
> Review request for Ambari and Dmitro Lisnichenko.
> 
> 
> Bugs: AMBARI-10657
>     https://issues.apache.org/jira/browse/AMBARI-10657
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM: Ambari seems to lose control of Flume agents - reporting them as
> stopped even though the processes are still running.  
> Trying to start the agents again results in:
> 
>     
>     
>     Please shutdown the agentor disable this component, or the agent will bein an undefined
state. 
>     
>     Failed to bind to: /0.0.0.0:4545 Caused by: java.net.BindException: Address already
in use
> 
> STEPS TO REPRODUCE:  
> 1\. Killed all agents using kill -9 (this step was necessary as the agents
> were still running, but reported as stopped in Ambari)
> 
> 2\. Start agents using Ambari
> 
> 3\. Check the content of the pid file. In this case was 29873
> 
> 4\. Check the pid using "ps -aux | grep flume". The output in this case was:
> 
>     
>     
>     Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 
>     flume 29873 0.0 0.0 106060 1308 ? Ss 13:50 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45;
/usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file
/etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655

>     flume 29874 35.7 0.5 17222116 272028 ? Sl 13:50 0:10 /usr/jdk64/jdk1.7.0_45/bin/java
-Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
>     
> 
> Everything is running fine at this point.
> 
> 6\. Restart agents using flume
> 
> 7\. Check the content of the pid file. In this case it was still 29873
> 
> 8\. Check the pid using "ps -aux | grep flume". The output in this case was:
> 
>     
>     
>     Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 
>     flume 3097 0.0 0.0 106060 1308 ? Ss 13:54 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45;
/usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file
/etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655

>     flume 3098 7.2 0.5 17222116 271076 ? Sl 13:54 0:10 /usr/jdk64/jdk1.7.0_45/bin/java
-Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
>     
> 
> As you can see the pid file was not updated and shortly after the restart,
> Ambari reports the agents as stopped.
> 
> SUPPORT ANALYSIS:
> 
> "cat /var/run/flume/a1.pid" returns 10056 last written 16 March 2015 13:04
> 
> When I check the running processes using "ps -aux | grep flume" it shows 26288
> and 26289.
> 
>     
>     
>     flume 26288 0.0 0.0 106060 1308 ? Ss 13:04 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45;
/usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file
/etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655

>     flume 26289 13.2 0.5 18359888 294220 ? Sl 13:04 1:15 /usr/jdk64/jdk1.7.0_45/bin/java
-Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
>     
> 
> The content of "/var/run/flume/ambari-state.txt" is RUNNING.
> 
> When I check the flume log file, nothing out of the ordinary is shown around
> the time the pid was updated.  
> I used "cat /var/log/flume/flume-a1.log | grep "16 Mar 2015 12:04"
> 
>     
>     
>     16 Mar 2015 12:04:13,166 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214)
- Start checkpoint for /home/flume/.flume/file-channel/checkpoint/checkpoint_1426501435529,
elements to sync = 18272 
>     16 Mar 2015 12:04:13,241 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239)
- Updating checkpoint metadata: logWriteOrderID: 1426503859575, queueSize: 576, queueHead:
475305 
>     16 Mar 2015 12:04:13,341 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.writeCheckpoint:1025)
- Updated checkpoint for file: /home/flume/.flume/file-channel/data/log-6 position: 9108128
logWriteOrderID: 1426503859575 
>     16 Mar 2015 12:04:13,342 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.LogFile$RandomReader.close:504)
- Closing RandomReader /home/flume/.flume/file-channel/data/log-4 
>     16 Mar 2015 12:04:43,348 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214)
- Start checkpoint for /home/flume/.flume/file-channel/checkpoint/checkpoint_1426501435529,
elements to sync = 20332 
>     16 Mar 2015 12:04:43,519 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239)
- Updating checkpoint metadata: logWriteOrderID: 1426503900154, queueSize: 0, queueHead: 495637

>     16 Mar 2015 12:04:43,628 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.writeCheckpoint:1025)
- Updated checkpoint for file: /home/flume/.flume/file-channel/data/log-6 position: 19009888
logWriteOrderID: 1426503900154 
>     16 Mar 2015 12:04:43,629 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.removeOldLogs:1080)
- Removing old file: /home/flume/.flume/file-channel/data/log-4 
>     16 Mar 2015 12:04:43,632 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.removeOldLogs:1080)
- Removing old file: /home/flume/.flume/file-channel/data/log-4.meta 
>     
> 
> Attached are flume conf, the output of the restart operation in ambari when
> the agents are reported as stopped but are still running, agent log and
> screenshot of ambari.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/functions/flume_agent_helper.py
4070006 
>   ambari-server/src/main/resources/common-services/FLUME/1.4.0.2.0/package/scripts/flume.py
ee1ed00 
>   ambari-server/src/test/python/stacks/2.0.6/FLUME/test_flume.py 77494af 
> 
> Diff: https://reviews.apache.org/r/33435/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message