Return-Path: Delivered-To: apmail-jakarta-jcs-users-archive@www.apache.org Received: (qmail 69194 invoked from network); 11 Apr 2008 17:32:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Apr 2008 17:32:32 -0000 Received: (qmail 29516 invoked by uid 500); 11 Apr 2008 17:32:32 -0000 Delivered-To: apmail-jakarta-jcs-users-archive@jakarta.apache.org Received: (qmail 29344 invoked by uid 500); 11 Apr 2008 17:32:31 -0000 Mailing-List: contact jcs-users-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "JCS Users List" Delivered-To: mailing list jcs-users@jakarta.apache.org Received: (qmail 29328 invoked by uid 99); 11 Apr 2008 17:32:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Apr 2008 10:32:31 -0700 X-ASF-Spam-Status: No, hits=3.2 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [217.154.246.187] (HELO lon-gs2dmrelay.mistral.net) (217.154.246.187) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Apr 2008 17:31:36 +0000 Received: from 78-86-124-231.zone2.bethere.co.uk ([78.86.124.231] helo=[192.168.10.55]) by lon-gs2dmrelay.mistral.net with esmtpa (Exim 4.51) id 1JkMwk-0002uo-OL for jcs-users@jakarta.apache.org; Fri, 11 Apr 2008 18:21:55 +0100 Subject: Re: JCS remote server From: Niall Gallagher To: JCS Users List In-Reply-To: <47FE6B4A.9090708@loki.ws> References: <552493.65720.qm@web38705.mail.mud.yahoo.com> <47FE2433.30009@loki.ws> <47FE6B4A.9090708@loki.ws> Content-Type: multipart/alternative; boundary="=-2nqiFfP4K2PPcHFVoaqB" Organization: Switchfire Ltd. Date: Fri, 11 Apr 2008 18:31:51 +0100 Message-Id: <1207935111.7359.132.camel@localhost> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.fc6) X-Virus-Checked: Checked by ClamAV on apache.org --=-2nqiFfP4K2PPcHFVoaqB Content-Type: text/plain Content-Transfer-Encoding: 7bit Hi Josh, I couldn't access your link, connection refused. I'll be out of the office until next wednesday so I hope you have some success by then. Kind regards, Niall On Thu, 2008-04-10 at 15:32 -0400, Joshua Szmajda wrote: > Ok, caught one! The logs are pretty big, so I put them up here: > http://loki.ws/~josh/restart-20080410.tar.bz2 > > I'm really not sure what caused it, it seems to have happened a little > more quickly than usual. > > Does it seem to be the GC? I can't tell, I can try adding in those GC > tuning things but I don't want to jump the gun and change too many > variables at once. I'll add in the tracing Al suggested though at least. > > Thanks! > -Josh > > Joshua Szmajda wrote: > > I'd been deleting the logs, so I don't have one right now ><. I did > > change my scripts to save them though. As soon as it happens again > > I'll have some data. It seems to take about a week or so of running > > from a fresh start before I start to get problems. > > > > Niall: thanks for the explanation. I figured they were probably Byte > > arrays, but then I saw the Strings and that threw me off :). > > > > Anyway as soon as I get some real data I'll post it to the list. > > > > Thanks all! > > -Josh > > > > Aaron Smuts wrote: > >> Do you have any of the cache logs when this is > >> happening? > >> > >> I would turn the memory shrinker off (set the property > >> to false), as a start. I generally don't run with the > >> memory shrinker on. But I'm shooting in the dark. > >> > >> Aaron > >> > >> > >> --- Joshua Szmajda wrote: > >> > >> > >>> Ahh yes of course, it was the user requirement. Now > >>> I have a nice bunch of data. This is interesting, but I'm not sure what > >>> the [B class is: > >>> > >>> num #instances #bytes class name > >>> -------------------------------------- > >>> 1: 31419 284852480 [B > >>> 2: 2277 19760264 [I > >>> 3: 57834 3865240 [C > >>> 4: 29628 1896192 org.apache.jcs.engine.ElementAttributes > >>> 5: 57838 1388112 java.lang.String > >>> ... > >>> > >>> Niall Gallagher wrote: > >>> > >>>> Hmm :D > >>>> > >>>> I just did a bit of digging. I've used this script > >>>> > >>> on a few of our > >>> > >>>> servers in the past (32 and 64bit server VMs), but > >>>> > >>> I just found a server > >>> > >>>> which gave me the exact same error message you > >>>> > >>> got. That server it turns > >>> > >>>> out runs Java under a different user account to > >>>> > >>> the one I was logged > >>> > >>>> into however. > >>>> > >>>> Try running the script from the exact same user > >>>> > >>> account the JVM process > >>> > >>>> is running from. Even running from root doesn't > >>>> > >>> work didn't work for me > >>> > >>>> on that server, it had to be exact same user > >>>> > >>> account, which is > >>> > >>>> surprising. > >>>> > >>>> By the way those tools are documented here: > >>>> > >>>> > >> http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jmap.html > >> > >>>> and > >>>> > >>>> > >> http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstack.html > >> > >>>> -basically they're supposed to work on most > >>>> > >>> platforms except Windows and > >>> > >>>> Linux Itanium so unless you've got Itanium cpus it > >>>> > >>> should work for you. > >>> > >>>> On Wed, 2008-04-09 at 14:44 -0400, Joshua Szmajda > >>>> > >>> wrote: > >>> > >>>> > >>>>> Hey Niall, > >>>>> > >>>>> Thanks for your script, but I'm getting these > >>>>> > >>> errors: > >>> > >>>>> ./capture-diagnostics.sh RemoteCacheServerFactory > >>>>> Capturing diagnostics for Java process > >>>>> > >>> "RemoteCacheServerFactory" (pid > >>>>> 2007)... > >>>>> 2007: Unable to open socket file: target process > >>>>> > >>> not responding or > >>>>> HotSpot VM not loaded > >>>>> The -F option can be used when the target process > >>>>> > >>> is not responding > >>> > >>>>> 2007: Unable to open socket file: target process > >>>>> > >>> not responding or > >>>>> HotSpot VM not loaded > >>>>> The -F option can be used when the target process > >>>>> > >>> is not responding > >>> > >>>>> Saved diagnostics for "RemoteCacheServerFactory" > >>>>> > >>> to > >>>>> "RemoteCacheServerFactory-diagnostics.txt" > >>>>> > >>>>> There must be something I'm missing when I'm > >>>>> > >>> running the cache server. I > >>>>> noticed it uses the 'server' VM by default, maybe > >>>>> > >>> these debug commands > >>>>> are only good for the client VM? > >>>>> > >>>>> Thanks! > >>>>> -Josh > >>>>> > >>>>> Niall Gallagher wrote: > >>>>> > >>>>>> Hi Josh, > >>>>>> > >>>>>> Can you modify your cron job to capture > >>>>>> > >>> diagnostics before it restarts > >>> > >>>>>> the cache server? > >>>>>> > >>>>>> Then you can post the diagnostics next time it > >>>>>> > >>> happens. The script below > >>> > >>>>>> will capture diagnostics for you. We use > >>>>>> > >>> something like this in-house > >>> > >>>>>> for troubleshooting (not specifically for JCS). > >>>>>> > >>>>>> You'll first have to run the JDK 'jps' command > >>>>>> > >>> from either root, or the > >>> > >>>>>> user account which runs your cache server > >>>>>> > >>> instance. This gives you the > >>> > >>>>>> "name" of your cache server JVM process, which > >>>>>> > >>> you need to supply to the > >>> > >>>>>> diagnostics script as command-line parameter. > >>>>>> > >>> The script uses the name > >>> > >>>>>> to attach to the relevant JVM process. > >>>>>> > >>>>>> I don't know what might be causing the problem > >>>>>> > >>> for you. It could be a > >>> > >>>>>> bug in JCS, or it could be a memory issue. The > >>>>>> > >>> diagnostics will help > >>> > >>>>>> identify the problem. > >>>>>> > >>>>>> Save this as "capture-diagnostics.sh"... > >>>>>> ------- > >>>>>> #!/bin/sh > >>>>>> # Saves the stack traces and class memory usage > >>>>>> > >>> information for a > >>> > >>>>>> # Java process running on the machine to a > >>>>>> > >>> diagnostics file. > >>> > >>>>>> # > >>>>>> # This script expects the name of the relevant > >>>>>> > >>> Java process to be > >>> > >>>>>> # specified as a parameter. The name specified > >>>>>> > >>> should match a Java > >>> > >>>>>> # process name as listed by running the JDK > >>>>>> > >>> 'jps' command. > >>> > >>>>>> # > >>>>>> # Usage: sh capture-diagnostics.sh >>>>>> > >>> process> > >>> > >>>>>> APP_NAME="$1" > >>>>>> JDK_LOCATION="/usr/java/default" > >>>>>> DUMP_FILE="$APP_NAME-diagnostics.txt" > >>>>>> > >>>>>> APP_PID="`$JDK_LOCATION/bin/jps|grep $APP_NAME > >>>>>> > >>> 2> /dev/null|cut -d\ > >>> > >>>>>> -f1`" > >>>>>> if [ "$APP_PID" = "" ]; then > >>>>>> echo "ERROR: Can't determine pid of Java process > >>>>>> > >>> name specified > >>> > >>>>>> \"$APP_NAME\"" > >>>>>> echo "Usage: sh capture-diagnostics.sh >>>>>> > >>> process as listed by jps > >>> > >>>>>> command>" > >>>>>> exit 20 > >>>>>> fi > >>>>>> echo "Capturing diagnostics for Java process > >>>>>> > >>> \"$APP_NAME\" (pid > >>> > >>>>>> $APP_PID)..." > >>>>>> echo -e "Diagnostics for Java process > >>>>>> > >>> \"$APP_NAME\" (pid $APP_PID) as at > >>> > >>>>>> `date`:-" >> $DUMP_FILE > >>>>>> echo -e "\nTop 30 memory-consuming classes:-" >> > >>>>>> > >>> $DUMP_FILE > >>> > >>>>>> $JDK_LOCATION/bin/jmap -histo:live $APP_PID > >>>>>> > >>> |head -n33 >> $DUMP_FILE > >>> > >>>>>> echo -e "\nThread stack traces:-" >> $DUMP_FILE > >>>>>> $JDK_LOCATION/bin/jstack $APP_PID >> $DUMP_FILE > >>>>>> echo -e "\n" >> $DUMP_FILE > >>>>>> echo "Saved diagnostics for \"$APP_NAME\" to > >>>>>> > >>> \"$DUMP_FILE\"" > >>> > >>>>>> ------- > >>>>>> > >>>>>> > >>>>>> On Wed, 2008-04-09 at 10:11 -0400, Joshua > >>>>>> > >>> Szmajda wrote: > >>> > >>>>>> > >>>>>>> Hey all, > >>>>>>> > >>>>>>> I've got a JCS remote cache server running on a > >>>>>>> > >>> machine and every now > >>>>>>> and then it will spiral out of control and lock > >>>>>>> > >>> the machine. I have no > >>>>>>> idea yet what's causing this, I've just put > >>>>>>> > >>> some extra measures in place > >>>>>>> to capture the logs from when it happens. My > >>>>>>> > >>> solution at this point is a > >>>>>>> cron job that checks now and then for excessive > >>>>>>> > >>> cpu usage and restarts > >>>>>>> the cache server. I'd like to be able to not > >>>>>>> > >>> worry about it, though :). > >>> > >>>>>>> Any suggestions? > >>>>>>> > >>>>>>> Thanks! > >>>>>>> -Josh > >>>>>>> > >>>>>>> P.S. it's running on ubuntu-server (kernel > >>>>>>> > >>> 2.6.22-14-server). > >>> > >>>>>>> I have up to 16 remote listeners connecting to > >>>>>>> > >>> any given region. > >>>>>>> (probably 20 application instances in all). > >>>>>>> Puts grow at a rate of about 400 per second. > >>>>>>> I pass these options to java: "-Xms128m > >>>>>>> > >>> -Xmx2000m" > >>> > >>>>>>> And here's my simple remote.cache.ccf: > >>>>>>> > >> === message truncated === > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: jcs-users-unsubscribe@jakarta.apache.org > >> For additional commands, e-mail: jcs-users-help@jakarta.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: jcs-users-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: jcs-users-help@jakarta.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: jcs-users-unsubscribe@jakarta.apache.org > For additional commands, e-mail: jcs-users-help@jakarta.apache.org --=-2nqiFfP4K2PPcHFVoaqB--