cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurav Lahiri <saurav.lah...@sungard.com>
Subject Re: system vm disk space issue in ACS 4.3
Date Mon, 24 Mar 2014 13:32:00 GMT
So if you look at the code for run.sh you will see that it is a while loop
that calls _run.sh. If _run.sh  gets terminated and the java process is
still running, future runs of _run.sh will attempt to start the java
process and will fail. But as it fails a whole lot of log message will get
logged to cloud.out by the /etc/init.d/cloud script. I guess the fix I am
working on should also address( by preventing scripts from logging to
cloud.out) this issue.

Thanks
Saurav


On Thu, Mar 20, 2014 at 10:29 PM, Sunil Bakhru <sbakhru@juniper.net> wrote:

> The fix for 6258 will contain the log size and prevent it from hogging the
> root partition.
>
> However, we do see the following 'ERROR' log which fills up our system VM
> log file.  Has anybody seen this?
> Any pointers on what could possibly be wrong?
>
>
> 2014-03-20 08:40:54,395 ERROR [cloud.agent.AgentShell] (main:null) Unable
> to start agent: Java process is being started twice.  If this is not true,
> remove /var/run/agent.SecStorage.pid
> 2014-03-20 08:41:05,359 INFO  [cloud.agent.AgentShell] (main:null) Agent
> started
> 2014-03-20 08:41:05,364 INFO  [cloud.agent.AgentShell] (main:null)
> Implementation Version is 4.3.0
> 2014-03-20 08:41:05,364 INFO  [cloud.agent.AgentShell] (main:null)
> agent.properties found at /usr/local/cloud/systemvm/conf/agent.properties
> 2014-03-20 08:41:05,370 DEBUG [cloud.agent.AgentShell] (main:null) Found
> property: instance
> 2014-03-20 08:41:05,370 DEBUG [cloud.agent.AgentShell] (main:null) Found
> property: NfsSecondaryStorageResource.id
> 2014-03-20 08:41:05,370 DEBUG [cloud.agent.AgentShell] (main:null) Found
> property: resource
> 2014-03-20 08:41:05,370 INFO  [cloud.agent.AgentShell] (main:null)
> Defaulting to using properties file for storage
> 2014-03-20 08:41:05,371 INFO  [cloud.agent.AgentShell] (main:null)
> Defaulting to the constant time backoff algorithm
> 2014-03-20 08:41:05,380 INFO  [cloud.utils.LogUtils] (main:null) log4j
> configuration found at /usr/local/cloud/systemvm/conf/log4j-cloud.xml
> 2014-03-20 08:41:05,393 DEBUG [cloud.agent.AgentShell] (main:null)
> Checking to see if agent.SecStorage.pid exists.
> 2014-03-20 08:41:05,396 DEBUG [cloud.utils.ProcessUtil] (main:null)
> environment.properties could not be opened
> 2014-03-20 08:41:05,403 DEBUG [cloud.utils.ProcessUtil] (main:null)
> Executing: bash -c ps -p 3469
> 2014-03-20 08:41:05,411 DEBUG [cloud.utils.ProcessUtil] (main:null)
> Execution is successful.
> 2014-03-20 08:41:05,411 DEBUG [cloud.utils.ProcessUtil] (main:null)   PID
> TTY          TIME CMD
>  3469 ?        00:00:36 java
>
> 2014-03-20 08:41:05,411 ERROR [cloud.agent.AgentShell] (main:null) Unable
> to start agent: Java process is being started twice.  If this is not true,
> remove /var/run/agent.SecStorage.pid
>
>
>
> Appreciate the help.
>
> Thanks,
> Sunil
>
>
>
>
>
> On 3/19/14 8:40 AM, "Rajesh Battala" <rajesh.battala@citrix.com> wrote:
>
> >Great. Post your patch at reviews.apache.org
> >
> >Thanks
> >Rajesh Battala
> >
> >-----Original Message-----
> >From: Saurav Lahiri [mailto:saurav.lahiri@sungard.com]
> >Sent: Wednesday, March 19, 2014 8:49 PM
> >To: dev@cloudstack.apache.org
> >Subject: Re: system vm disk space issue in ACS 4.3
> >
> >Thanks Rajesh. I have created a jira ticket for this
> >https://issues.apache.org/jira/browse/CLOUDSTACK-6258. Will send in the
> >fix for review in a couple of days.
> >
> >Thanks
> >Saurav
> >
> >
> >On Wed, Mar 19, 2014 at 8:03 PM, Rajesh Battala
> ><rajesh.battala@citrix.com>wrote:
> >
> >> Can you please file a bug and send your fix for review.
> >>
> >> Thanks
> >> Rajesh Battala
> >>
> >> -----Original Message-----
> >> From: Saurav Lahiri [mailto:saurav.lahiri@sungard.com]
> >> Sent: Wednesday, March 19, 2014 7:20 PM
> >> To: dev@cloudstack.apache.org
> >> Subject: Re: system vm disk space issue in ACS 4.3
> >>
> >> The problem appears to be the start function in the /etc/init.d/cloud
> >> service for console proxy.
> >> More specifically the following line also writes to /var/log/cloud.out
> >>
> >>
> >> ----------------------------------------------------------------------
> >> ------------------------------------------------------------
> >> (cd $CLOUD_COM_HOME/systemvm; nohup ./run.sh >
> >> /var/log/cloud/cloud.out
> >> 2>&1 & )
> >>
> >> ----------------------------------------------------------------------
> >> ------------------------------------------------------------
> >>
> >> since run.sh calls _run.sh and both has "set -x" enabled, in certain
> >> situations they can keep logging messages to cloud.out without being
> >> aware of the settings in log4j-cloud.xml
> >>
> >>
> >> One way to fix that could be that run.sh and _run.sh would log to
> >> cloud.out only if a debug flag was set to true, otherwise only the
> >> java process would write to cloud.out and log4j would respect the
> >> settings in log4j-cloud.xml
> >>
> >>
> >> Thanks
> >> Saurav
> >>
> >>
> >>
> >> On Mon, Mar 17, 2014 at 8:47 PM, Saurav Lahiri
> >> <saurav.lahiri@sungard.com
> >> >wrote:
> >>
> >> > Could it have  something to do with the RollingFileAppender that is
> >> > being used.
> >> > The following
> >> > rollingfileappender<http://apache-logging.6191.n7.nabble.com/Rolling
> >> > Fi leAppender-not-working-consistently-td8582.html> link appears to
> >> > be a
> >> bit outdated but they more or less describe a similar problem that we
> >> are seeing?
> >> >
> >> >
> >> > On our environment that is what we have seeing for sometime on
> >> > console proxy.  The root filesystem goes full with the cloud.out.*
> >> > occupying all the space. This happens pretty frequently and we have
> >> > to regularly recycle the console proxy to resolve this issue.
> >> >
> >> >
> >> > As seen below, cloud.out.2 should not have exceeded 10MB but it
> >> > stands at 217MB now.
> >> >
> >> > drwxr-xr-x 2 root root 4.0K Mar 17 14:57 .
> >> > drwxr-xr-x 8 root root 4.0K Mar 17 15:01 ..
> >> > -rw-r--r-- 1 root root    0 Mar 12 18:18 api-server.log
> >> > -rw-r--r-- 1 root root 357K Mar 17 15:06 cloud.out
> >> > -rw-r--r-- 1 root root 2.1M Mar 17 14:56 cloud.out.1
> >> > -rw-r--r-- 1 root root 217M Mar 17 15:06 cloud.out.2
> >> >
> >> > root@v-zzzz-VM:/var/log/cloud# lsof | grep cloud.out
> >> > sleep       649 root    1w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > sleep       649 root    2w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2312 root    1w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2312 root    2w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2339 root    1w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2339 root    2w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2786 root    1w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > bash       2786 root    2w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > java       2805 root    1w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > java       2805 root    2w      REG      202,1 226122291     181737
> >> > /var/log/cloud/cloud.out.2
> >> > java       2805 root  116w      REG      202,1    319382     181769
> >> > /var/log/cloud/cloud.out
> >> > root@v-zzzz-VM:/var/log/cloud# ls -alh
> >> >
> >> > Thanks
> >> > Saurav
> >> >
> >> >
> >> > On Tue, Mar 11, 2014 at 7:58 AM, Chiradeep Vittal <
> >> > Chiradeep.Vittal@citrix.com> wrote:
> >> >
> >> >> Yes, it was deliberate. I canĀ¹t find the discussion, but it
> >> >> revolved around a security best practice of having separate
> >> >> partitions for /, /swap, home directories
> >> >>
> >> >>
> >> >> On 3/10/14, 11:35 AM, "Marcus" <shadowsor@gmail.com> wrote:
> >> >>
> >> >> >There have been several raised, actually regarding /var/log.  As
> >> >> >for the system vm partitioning, it was explicitly changed from
> >> >> >single to multiple partitions last year. I have no idea why, but
I
> >> >> >generally don't file bugs without community discussion on things
> >> >> >that seem deliberate.
> >> >> >
> >> >> >On Sat, Mar 8, 2014 at 11:32 AM, Marcus <shadowsor@gmail.com>
> wrote:
> >> >> >> Yeah, I've just seen on busy systems where even with log
> >> >> >>rotation working  properly the little space left in var after
OS
> >> >> >>files is barely enough, for  example the conntrackd log on
a busy
> >> >> >>VPC. We actually ended up rolling our  own system vm, the
> >> >> >>existing image has plenty of space, its just locked up in 
other
> >>partitions.
> >> >> >>
> >> >> >> On Mar 8, 2014 8:58 AM, "Rajesh Battala"
> >> >> >><rajesh.battala@citrix.com>
> >> >> >>wrote:
> >> >> >>>
> >> >> >>> Yes, only 435MB is available for /var . we can increase
the
> >> >> >>> space
> >> >> also.
> >> >> >>> But we need to find out the root cause which services
are
> >> >> >>>causing the /var  to fill up.
> >> >> >>> Can you please find out and post which log files are taking
up
> >> >> >>>more space  in /var
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>> Rajesh Battala
> >> >> >>>
> >> >> >>> -----Original Message-----
> >> >> >>> From: Marcus [mailto:shadowsor@gmail.com]
> >> >> >>> Sent: Saturday, March 8, 2014 8:19 PM
> >> >> >>> To: dev@cloudstack.apache.org
> >> >> >>> Subject: RE: system vm disk space issue in ACS 4.3
> >> >> >>>
> >> >> >>> Perhaps there's a new service. I know in the past we've
seen
> >> >> >>>issues with  this , specifically the conntrackd log. I
think the
> >> >> >>>cloud logs weren't  getting rolled either, but I thought
it was
> >> >> >>>all fixed.
> >> >> >>>
> >> >> >>> There's also simply not a ton of space on /var, I wish
we would
> >> >> >>>go back to  just having one partition because it orphans
lots of
> >> >> >>>free space in other  filesystems.
> >> >> >>> On Mar 8, 2014 12:37 AM, "Rajesh Battala"
> >> >> >>><rajesh.battala@citrix.com>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > AFAIK, log roation is enabled in the systemvm.
> >> >> >>> > Can you check whether the logs are getting zipped
.?
> >> >> >>> >
> >> >> >>> > -----Original Message-----
> >> >> >>> > From: Anirban Chakraborty [mailto:abchak@juniper.net]
> >> >> >>> > Sent: Saturday, March 8, 2014 12:46 PM
> >> >> >>> > To: dev@cloudstack.apache.org
> >> >> >>> > Subject: system vm disk space issue in ACS 4.3
> >> >> >>> >
> >> >> >>> > Hi All,
> >> >> >>> >
> >> >> >>> > I am seeing system vm disk has no space left after
running
> >> >> >>> > for few
> >> >> >>>days.
> >> >> >>> > Cloudstack UI shows the agent in v-2-VM in alert
state, while
> >> >> >>> > agent state of s-1-VM shows blank (hyphen in the
UI).
> >> >> >>> > Both the system vms are running and ssh-able from
the host.
> >> >> >>> > The log
> >> >> >>>in
> >> >> >>> > s-1-Vm shows following errors:
> >> >> >>> >
> >> >> >>> > root@s-1-VM:~# grep 'Exception' /var/log/cloud/*.*
> >> >> >>> > /var/log/cloud/cloud.out.2:java.io.IOException: No
space left
> >> >> >>> > on device
> >> >> >>> > /var/log/cloud/cloud.out.2:java.io.IOException: No
space left
> >> >> >>> > on device
> >> >> >>> >
> >> >> >>> > whereas logs in v-1-VM shows
> >> >> >>> > /var/log/cloud/cloud.out.3:java.io.IOException: No
space left
> >> >> >>> > on device
> >> >> >>> > /var/log/cloud/cloud.out.3:java.io.IOException: No
space left
> >> >> >>> > on device
> >> >> >>> > /var/log/cloud/cloud.out.3:07:18:00,547  INFO
> >> >> CSExceptionErrorCode:87
> >> >> >>> > - Could not find exception:
> >> >> >>> > com.cloud.exception.AgentControlChannelException
> >> >> >>> > in error code list for exceptions
> >> >> >>> >
> >> >> >>> >
> >> >>
> >> >> >>>/var/log/cloud/cloud.out.3:com.cloud.exception.AgentControlChann
> >> >> >>>el
> >> >> >>>Except
> >> >> >>>ion:
> >> >> >>> > Unable to post agent control request as link is not
available
> >> >> >>> >
> >> >> >>> > Looks like cloud agent is filling up the log, which
is
> >> >> >>> > leading to
> >> >> the
> >> >> >>> > disk full state.
> >> >> >>> >
> >> >> >>> > Is this a known issue? Thanks.
> >> >> >>> >
> >> >> >>> > Anirban
> >> >> >>> >
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message