Return-Path: Delivered-To: apmail-db-derby-user-archive@www.apache.org Received: (qmail 30792 invoked from network); 1 Apr 2008 18:01:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Apr 2008 18:01:06 -0000 Received: (qmail 36286 invoked by uid 500); 1 Apr 2008 18:01:05 -0000 Delivered-To: apmail-db-derby-user-archive@db.apache.org Received: (qmail 36259 invoked by uid 500); 1 Apr 2008 18:01:05 -0000 Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Discussion" Delivered-To: mailing list derby-user@db.apache.org Received: (qmail 36248 invoked by uid 99); 1 Apr 2008 18:01:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2008 11:01:05 -0700 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 32.97.182.146 is neither permitted nor denied by domain of Stan.Bradbury@gmail.com) Received: from [32.97.182.146] (HELO e6.ny.us.ibm.com) (32.97.182.146) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2008 18:00:22 +0000 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id m31I2a8R007344 for ; Tue, 1 Apr 2008 14:02:36 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v8.7) with ESMTP id m31I0NmV1871900 for ; Tue, 1 Apr 2008 14:00:24 -0400 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m31I0NJ8000513 for ; Tue, 1 Apr 2008 12:00:23 -0600 Received: from [127.0.0.1] (sig-9-48-104-190.mts.ibm.com [9.48.104.190]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m31I0Ll4032760 for ; Tue, 1 Apr 2008 12:00:22 -0600 Message-ID: <47F27834.4020701@gmail.com> Date: Tue, 01 Apr 2008 11:00:20 -0700 From: Stanley Bradbury User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: Derby Discussion Subject: Re: ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313)) References: <47F02F3E.40807@nuix.com> <47F065D0.6010404@sun.com> <47F074A0.40309@nuix.com> <47F16AD7.6090101@gmail.com> <47F1AEF3.8080603@nuix.com> In-Reply-To: <47F1AEF3.8080603@nuix.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org David Sitsky wrote: > Hi Stanley, > >> The problem is happening on the Page 0 [the first page] of >> conglomerate 1313 (the conglomerateId) - you can see what table/index >> this corresponds to with the following query: >> >> select CONGLOMERATENUMBER, CONGLOMERATENAME >> from sys.sysconglomerates >> where conglomeratenumber = 1313; >> >> From the errors in your log I suspect this to be the table or one of >> the indexes of the failing query listed: >> INSERT INTO text_table (guidhigh, guid, data) > > I ran ij from the command-line, and connected to one of the > problematic databases after killing the load. Sure enough 1313 refers > to the primary key of text_table (there are no other indexes defined > on this table). This is what is output: > > CONGLOMERATENUMBER |CONGLOMERATENAME > > -------------------------------------------------------------------------------- > > --------------------------------------------------------------------- > 1313 |SQL080331145004520 > > This is the SQL we used to create this table: > > CREATE TABLE text_table (guidhigh BIGINT NOT NULL, > guid BIGINT NOT NULL, > data BLOB (1G) NOT NULL, > PRIMARY KEY (guidhigh, guid)) > > What is interesting is if I perform a simple query on text_table, I > get the following: > > ij> select count(*) from text_table; > ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313)), > expected=304,608,373, on-disk version=2,462,088,751, page dump > follows: Hex dump: > 00000000: 0076 0000 0001 0000 0000 0000 27ea 0000 .v.............. > 00000010: 0000 0006 0000 0000 0000 0000 0000 0000 ................ > 00000020: 0000 0000 0001 0000 0000 0000 0000 0000 ................ > .... > > Same as my application. I have also noticed the connection is > automatically closed, as is the case in my application if I try and > perform another operation. > > ij> select count(*) from text_table; > ERROR 08003: No current connection. > > Reconnecting to the database has the same result. > > So I guess it is not surprising now that conglomerate 1313 is always > returned since all worker processes create the same database on > startup. However the corruption is always happening on page 0. If we > were experiencing true disk problems, I'd expect the page number to be > random across the machines. > > Any ideas on what to try next? > Hi Dave - The good news is that it looks like you do not have to recover this database from backups because: 1) since you can boot the database using IJ the corruption is not in the transaction log (I assume you did not directly modify or delete any files in the database directory tree in order to bypass recovery - *yes, I have seen people do this*) 2) since the corruption is in an index you should be able to drop and recreate the index thus restoring the index without any data loss. I also have an idea that might get around it. Since the problem is with an index and during a data load (I am assuming the load is a one time database initialization streaming lots of data?) then I would try creating the index AFTER loading the data rather than declaring the index in the table Create statement. If the problem involves handling a large number of database pages and the index pages at the same time separating the activities will avoid it. Doing this also insures that index statistics are created (for more info on this see: http://wiki.apache.org/db-derby/CheckingForIndexStatistics) As for the cause of the problem I can't be sure but since this happens repeatedly and with different physical files it sounds like the problem is not with the physical disk surface (I think the RAID architecture helps protect against this type of failure anyway). I'd like to see this problem captured in an JIRA entry. You have provided a lot of good information here and the results of performing a load then creating the Primary Key will narrow it down more. My gut-level feel is that reproducing this will be hard without an environment similar to yours. With the multiple processors and the RAID array I imagine it is a pretty fast machine and this may be exposing a thread synchronization issue in Derby (or not). Having your information and stack traces may give the stores folks enough information to perform a code inspection for this or other issues. Would you file a JIRA describing your processing, the machine environment, the JVM and attach your derby log files?