Return-Path: Delivered-To: apmail-incubator-esme-dev-archive@minotaur.apache.org Received: (qmail 84142 invoked from network); 26 Nov 2009 13:13:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Nov 2009 13:13:08 -0000 Received: (qmail 66625 invoked by uid 500); 26 Nov 2009 13:13:08 -0000 Delivered-To: apmail-incubator-esme-dev-archive@incubator.apache.org Received: (qmail 66564 invoked by uid 500); 26 Nov 2009 13:13:08 -0000 Mailing-List: contact esme-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: esme-dev@incubator.apache.org Delivered-To: mailing list esme-dev@incubator.apache.org Received: (qmail 66554 invoked by uid 99); 26 Nov 2009 13:13:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Nov 2009 13:13:08 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael.bechauf@sap.com designates 155.56.66.96 as permitted sender) Received: from [155.56.66.96] (HELO smtpgw01.sap-ag.de) (155.56.66.96) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Nov 2009 13:12:52 +0000 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Further analysis of the GC issue Date: Thu, 26 Nov 2009 08:12:24 -0500 Message-ID: <06802A3D4BE62D449D366CE4A27D61400301167C@usphle16.phl.sap.corp> In-Reply-To: <771905290911260103o49a051c4mbbfa82f5d408cafb@mail.gmail.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Further analysis of the GC issue Thread-Index: Acpud25f7ttJLeiHQi+Lm3f9sKKXBQAIkAIw References: <06802A3D4BE62D449D366CE4A27D614003011676@usphle16.phl.sap.corp> <771905290911260103o49a051c4mbbfa82f5d408cafb@mail.gmail.com> From: "Bechauf, Michael" To: X-OriginalArrivalTime: 26 Nov 2009 13:12:26.0353 (UTC) FILETIME=[1964B610:01CA6E9A] X-Virus-Checked: Checked by ClamAV on apache.org Thanks Markus. That certainly sounds much better. I was confused yesterday already because 23 GByte memory would be a little difficult to create when not even the operating system can handle such size. I should have asked right away. Blame it on jetlag. -Michael -----Original Message----- From: Markus Kohler [mailto:markus.kohler@gmail.com]=20 Sent: Thursday, Nov 26, 2009 1:04 AM To: esme-dev@incubator.apache.org Subject: Re: Further analysis of the GC issue Hi Michael, Good to see you here! "Memory Analyzer"? that's me ;-) The 23 Gbyte are not "retained" at one point in time, but they are the sum of all temporary allocated objects, most of memory, (or all of it, there doesn't seem to be an obvious memory leak), are gone within a millisecond. I'm confident that this value can be decreased to 90Mbyte and can be further improved down to a few MByte (or even less). We already know that the 90Mbyte are mostly caused be an inefficient textile parser. I also used the Memory Analyzer to look at how much memory is retained, e.g. still in use/referenced after the user interaction has been finished. The report is here http://cwiki.apache.org/confluence/display/ESME/Performance+test+-+2009- 11-22 Also there's room for improvement, potentially caused by the same bug that turned 90Mbyte into 23Gbyte, I don't see any major issues yet with regards to memory usage. This is also related to the state less versus state full discussion, ATM the amount of state needed for one user is already very low ( a few hundred kByte), at least compared to what I'm used to with Enterprise Applications. It is at least an order of magnitude lower, which can only partially explained by ESME being less complex than the typical Enterprise app. So far I don't see any major road block from the design perspective that would stop us from scaling very well. In my experience, it's quite normal that as soon as someone with a little bit of experience in performance takes as closer look at a software, that a few dramatic improvements can be made. That makes working as a performance analysis expert so gratifying. You suggest a few improvements, which have an dramatic impact, and then you walk away before it gets too complicated ;-) No, that's not my intention here :-) Markus "The best way to predict the future is to invent it" -- Alan Kay On Thu, Nov 26, 2009 at 6:04 AM, Bechauf, Michael wrote: > David, > > well, "dead wrong" is a strong expression; hopefully I'm still breathing. I > don't want to judge without having looked at the code myself, but I have no > idea how a massive multi-user system could possibly be designed with state > where per-user information is kept in memory for a certain time. I mean, 23 > GB allocated - that's tough for an SAP transaction server that is not > mutlithreaded and where the memory management is highly optimized based on > shared memory that the work processes can attach to, or rolled out to a file > if unused for a whilet. It is, however, deadly for a VM that was never > designed for such memory consumption and where a GC run can halt the server. > > Anyway, I'll study this a bit more, particularely the Scala architecture. I > heard many good things about Scala, but in the end it's all translated to > things a VM can understand, and I hope Scala does a good enough job managing > this load in a transparent way. > > -Michael > > > ----- Original Message ----- > From: David Pollak > To: esme-dev@incubator.apache.org > Sent: Wed Nov 25 23:00:20 2009 > Subject: Re: Further analysis of the GC issue > > On Wed, Nov 25, 2009 at 7:16 PM, Bechauf, Michael > wrote: > > > Wasn't this exactly the kind of stuff that the Eclipse Memory Analyzer - > > donated by SAP - was supposed to fix ? A heap of that size for a still > > moderate number of 300 users is crazy, so either there is stuff like > > circular references that hog memory, or the design model is fundamentally > > flawed. I don't understand why ESME needs "sessions" ? How can a > scaleable > > server be created if each user will allocate memory until some timeout. > In a > > world of stateless browser-based UIs that's not going to work. > > > > You're actually dead wrong about this. "Stateless" is not... it's just > pushing state and cache someplace else (the RDBMS, memcached, etc.). > "Stateless" will lead to radical performance problems. "Stateless" merely > moves the caching decisions into code you don't control. I dealt with this > issue first-hand while helping a popular micro-blogging site migrate from a > "stateless" to a Scala-based backend. I'm dealing with this issue > first-hand helping another popular site that's experiencing exponential > growth migrate away from "push everything back to the RDBMS and hope for > the > best." > > My original design for ESME is stateful. My original design for ESME is > based on lessoned learned in this very space and was oriented to have > things > intelligently cached so that the caching is not based on RDBMS indexes. > I'm > not sure what happened to cause the particular issues, but it seems like > folks are loading messages from the RDBMS rather than asking the message > cache for them. > > > > > > Time for me to look at that code ... > > > > -Michael > > > > > > ----- Original Message ----- > > From: Markus Kohler > > To: esme-dev@incubator.apache.org > > Sent: Wed Nov 25 12:14:58 2009 > > Subject: Further analysis of the GC issue > > > > Hi all, > > the Garbage Collector issue I was talking about is reproducible. > > I've uploaded an annotated GC graph to > > > > > http://picasaweb.google.com/lh/photo/wB-RRtb0wIVfpxJkTJPNuw?authkey=3DGv1= s RgCOve7LThpfvXsQE&feat=3Ddirectlink > > > > I think the "LOGON" phase where I logon all the 300 users looks ok (given > > that probably textile formatting is involved) but the phase where just > one > > user sends one message is certainly not looking good. > > > > I took the profiler and the result is a bit shocking. For that one > message, > > 881.000.000 objects weighting 23,2 Gbyte where allocated (and reclaimed > > afterwards). My former record was 2Gbyte ;-) > > > > Fortunately I have a theory what happens, without looking into the > > code,yet, > > so take it with a grain of salt. It seems that the public time line for > all > > users is re-rendered, because 99% of the allocations come > > from org.apache.esme.comet.PublicTimeline.render(). I guess all the > actors > > for all the users are sitting there, not knowing that the user has closed > > the browser, because the user session has not yet expired. > > > > I wonder how we get around this with a real "push" model. If the browser > > would ask for updates this rendering could be done lazily. Or can we > "ping" > > the browser and check whether it responds? > > On the other side. It should also not be necessary the re-render the > > message > > again and again because the result will be the same. > > > > I will send Richard some attachments. Not sure whether you will need > them, > > they look very similar to the ones we already have. > > > > BTW, we should definitely check the use > > of scala.xml.XML$.loadString(java.lang.String) > > It's creating a new Parser each time, which is a bit costly because it > > allocates a new Buffer each time and also hits the disk, when searching > for > > the name of the Java class. > > > > Greetings, > > Markus > > > > > > > > "The best way to predict the future is to invent it" -- Alan Kay > > > > > > -- > Lift, the simply functional web framework http://liftweb.net > Beginning Scala http://www.apress.com/book/view/1430219890 > Follow me: http://twitter.com/dpp > Surf the harmonics >