Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 100AA87F5 for ; Mon, 8 Aug 2011 16:35:12 +0000 (UTC) Received: (qmail 85154 invoked by uid 500); 8 Aug 2011 16:35:12 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 85085 invoked by uid 500); 8 Aug 2011 16:35:11 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 85077 invoked by uid 99); 8 Aug 2011 16:35:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Aug 2011 16:35:11 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shelsen@ca.ibm.com designates 32.97.182.139 as permitted sender) Received: from [32.97.182.139] (HELO e9.ny.us.ibm.com) (32.97.182.139) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Aug 2011 16:35:02 +0000 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e9.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p78G1XoU003157 for ; Mon, 8 Aug 2011 12:01:33 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p78GYckg186736 for ; Mon, 8 Aug 2011 12:34:38 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p78GYchQ004474 for ; Mon, 8 Aug 2011 13:34:38 -0300 Received: from d25ml03.torolab.ibm.com (d25ml03.torolab.ibm.com [9.26.6.104]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p78GYcxn004465 for ; Mon, 8 Aug 2011 13:34:38 -0300 In-Reply-To: <4E3FFD92.5000409@epimorphics.com> References: <4E3C0229.5050905@googlemail.com> <4E3FFD92.5000409@epimorphics.com> To: jena-dev@incubator.apache.org Cc: jena-dev@incubator.apache.org MIME-Version: 1.0 Subject: Re: testing TDB-Tx X-KeepSent: 0A5A0075:34710A17-852578E6:00562C7F; type=4; name=$KeepSent X-Mailer: Lotus Notes Release 8.5.1FP5 SHF29 November 12, 2010 Message-ID: From: Simon Helsen Date: Mon, 8 Aug 2011 12:34:37 -0400 X-MIMETrack: Serialize by Router on D25ML03/25/M/IBM(Release 8.5.1FP5|September 29, 2010) at 08/08/2011 12:34:37, Serialize complete at 08/08/2011 12:34:37 Content-Type: multipart/alternative; boundary="=_alternative 005B0F9F852578E6_=" X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 005B0F9F852578E6_= Content-Type: text/plain; charset="US-ASCII" Andy, thanks for the fix. With regard to the OME, that is a very strange one. I first did some tests using the IBM JRE 6 and using the eclipse memory analyzer (see http://wiki.eclipse.org/index.php/MemoryAnalyzer#System_Dumps_and_Heap_Dumps_from_IBM_Virtual_Machines if you're interested) I noticed that my 2Gb heap dump only showed about 110Mb of real data. My initial max heap was set to 1.5Gb and stack at 512Mb. I am not quite sure how that is possible except if the problem sits 1) in native memory or 2) there is a sudden spike in dynamic memory generation which does not give the GC time to collect before the JVM bails. My next test was to use Orcale's JRE 6 which did not start at first because maxPermGen wasn't large enough, so I increased it to 256Mb (the default is 64Mb) and then around the same time, I also ran into an OME but I don't see to get a useful heap dump to analyze. And I suspect it would show something similar to what I saw with the IBM JRE. I also tried just increasing the heap size from 1.5Gb to 3Gb (I have 8 total on my machine), and although that delays the problem, it does not actually avoid it and in fact, the moment the memory spikes (it happens really fast), my entire machine freezes as all my memory is more or less gone. A few things about the unit tests I am using. We have quite a few tests which "disconnect" and "connect" to the Jena dataset by opening and closing them (even moving the indices physically around, etc.) I suspect the problem is related to this, but unfortunately, I cannot pin it to one particular test since I can "move" the problem along by skipping tests. I should add this entire OME problem is a regression from our point of view since we did not observe anything like this in 0.8.7. I also have to emphasize that this is still without using the transaction API. I may give it a shot with the new API, but I doubt it will resolve the issue. This brings me to a question. Is calling reset on the StoreProvider to release all resources and allow me to remove the indices from disk? thanks, Simon From: Andy Seaborne To: jena-dev@incubator.apache.org Date: 08/08/2011 11:20 AM Subject: Re: testing TDB-Tx On 05/08/11 19:01, Simon Helsen wrote: > I tested my stuff in mapped mode which did not show the problem, so the > issue I encountered is specific to direct mode. IMO the code below > contains the problem and needs to be fixed with a call to > blockMgrCache.getWrite (on the wrapped BlockMgr) whenever there is a > cache miss. > > @Andy: could you fix this for the next build? Done in SVN. We've lost the repo (AWS ate the server) [temporarily] so for a bit, it's "svn update; mvn clean package" > I still hit the OME though. I will try to analyze the stack dumps to see > if there is anything special. When I hit the OME, it comes very quick, > i.e. in a matter of seconds, my entire heap space is exhausted from > regular heap usage before. So if I read the thread right, calling BulkUpdateHandlerTDB.removeAll still goes bad sometimes. Andy > > Simon > > Inactive hide details for Simon Helsen---08/05/2011 01:27:25 PM---Ok, so > I looked at the code in BlockMgrCache and I notice thaSimon > Helsen---08/05/2011 01:27:25 PM---Ok, so I looked at the code in > BlockMgrCache and I notice that getWrite is implemented like this: > > > From: > Simon Helsen/Toronto/IBM > > To: > jena-dev@incubator.apache.org > > Cc: > jena-dev@incubator.apache.org > > Date: > 08/05/2011 01:27 PM > > Subject: > Re: testing TDB-Tx > > ------------------------------------------------------------------------ > > > Ok, so I looked at the code in BlockMgrCache and I notice that getWrite > is implemented like this: > > @Override > *synchronized* > *public*Block getWrite(*long*_id) > { > Long id = Long./valueOf/(_id) ; > Block blk = *null*; > *if*( writeCache!= *null*) > blk = writeCache.get(id) ; > *if*( blk != *null*) > { > cacheWriteHits++ ; > log("Hit(w->w) : %d", id) ; > *return*blk ; > } > > // blk is null. > // A requested block may be in the other cache. Promote it. > > *if*( readCache.containsKey(id) ) > { > blk = readCache.get(id) ; > cacheReadHits++ ; > log("Hit(w->r) : %d", id) ; > blk = promote(blk) ; > *return*blk ; > } > > // Did not find. > cacheMisses++ ; > log("Miss/w: %d", id) ; > *if*( writeCache!= *null*) > writeCache.put(id, blk) ; > *return*blk ; > } > > Now, in my particular case, the id to come in is 0, but neither cache > contains the value. In this case, it will put the entry {0 = null} in > the write cache which necessarily leads to the NPE in the caller. So I > am not quite following the logic here. I would expect that if there is a > cache miss, the wrapped block mgr would be used to obtain block before > it is written in the writeCache. > > Simon > > > Inactive hide details for Simon Helsen---08/05/2011 12:01:57 PM---Paolo, > I don't know who wrote the code, but it would help if Simon > Helsen---08/05/2011 12:01:57 PM---Paolo, I don't know who wrote the > code, but it would help if a first analysis is > > > From: > Simon Helsen/Toronto/IBM@IBMCA > > To: > jena-dev@incubator.apache.org > > Cc: > jena-dev@incubator.apache.org > > Date: > 08/05/2011 12:01 PM > > Subject: > Re: testing TDB-Tx > > ------------------------------------------------------------------------ > > > > Paolo, > > I don't know who wrote the code, but it would help if a first analysis is > done with the stack trace I provided and perhaps other questions that can > help identify the problem and a possible fix. Producing sharable code > which reproduces the problem is not trivial and may not even be possible > since we run in a rather complex framework. If possible, I will try to > debug myself from within our framework but obviously, I have limited > knowledge of the details of the PageBlockMgr. > > All the instances of this stack trace (and I am seeing quite a few of > them) seem to come from BulkUpdateHandlerTDB.removeAll, but I know that > removeAll initially works fine (until the NPE occurs the first time - it > seems that after the first time, it keeps happening). I will also try to > isolate the problem more to see if there is anything specific that brings > the store in this situation > > thanks > > Simon > > > > From: > Paolo Castagna > To: > jena-dev@incubator.apache.org > Date: > 08/05/2011 10:46 AM > Subject: > Re: testing TDB-Tx > > > > Hi Simon, > I don't have an answer or a solution to your problem, but I want to thank > you for reporting your experience (and the problems you found) on > jena-dev. > > It would be extremely helpful if you could reproduce the problem with some > sharable code we can run and debug. I know, I know... it's not always easy > nor possible. > > I hit a problem using TestTransSystem.java which runs multiple threads and > it's not easy to replicate. > > Thanks again and keep sharing on jena-dev, this way everybody can benefit. > > Cheers, > Paolo > > Simon Helsen wrote: > > Hi everyone, > > > > I am giving a first stab at integrating TDB-Tx into our framework. My > > first goal is to test this new TDB *without* actually using the > > transaction API because we are coming from TDB 0.8.7. After some minor > > problems on our end, I seem to run into the following NPE (usually > > followed by a warning) > > > > 09:49:02,176 [jazz.jfs.suspending.indexer.internal.triple] ERROR > > com.ibm.team.jfs - CRJZS5663E Unable > to > > persist tripe index > > java.lang.NullPointerException > > at com.hp.hpl.jena.tdb.base.page.PageBlockMgr.getWrite( > > PageBlockMgr.java:50) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.getMgrWrite( > > BPTreeNode.java:162) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get( > > BPTreeNode.java:145) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.delete( > > BPTreeNode.java:227) > > at > > com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.deleteAndReturnOld( > > BPlusTree.java:324) > > at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.delete( > > BPlusTree.java:318) > > at com.hp.hpl.jena.tdb.index.TupleIndexRecord.performDelete( > > TupleIndexRecord.java:55) > > at com.hp.hpl.jena.tdb.index.TupleIndexBase.delete( > > TupleIndexBase.java:61) > > at > com.hp.hpl.jena.tdb.index.TupleTable.delete(TupleTable.java:108 > > ) > > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeWorker( > > BulkUpdateHandlerTDB.java:136) > > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeAll( > > BulkUpdateHandlerTDB.java:90) > > at com.hp.hpl.jena.rdf.model.impl.ModelCom.removeAll( > > ModelCom.java:315) > > ... > > 09:49:02,207 [jazz.jfs.suspending.indexer.internal.triple] WARN > > com.hp.hpl.jena.tdb.base.block.BlockMgrCache - Write cache: 0 > > expelling entry that isn't there > > > > The exception sits all over my log and I wonder if it is related to the > > removeAll. Also, after a while, my memory spikes and I run into an OME. > I > > don't know yet if there is a relation, but possible these exceptions > cause > > serious leaks. > > > > The version of TDB (and associated libs) I am using is > > tx-tdb-0.9.0-20110802.083904-6 > > > > thanks, > > > > Simon > > > > > --=_alternative 005B0F9F852578E6_=--