db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mike matrigali <mikema...@gmail.com>
Subject Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."
Date Sat, 05 Sep 2015 01:34:41 GMT
if it is not truncated and still available when you get the db a complete derby.log containing
the error
would be interesting - if it does not have anything that can not be shared.

Derby does not handle I/O errors very well, and it's shutdown mechanism is definitely not
clean.  Often
what happens is store encounters an I/O error and has no idea what to do at that point.  It's
default is
to mark some key modules null so that it knows no update action can take place and fails the
current
transaction and trys to shutdown the whole system.  Once system is shutdown reboot recovery
is counted
on to fix anything that was encountered, assuming nothing on disk was really corrupted.  This
sounds like
what happened in your case, but i would always consistency check if there is a problem.

/mikem

On 9/4/2015 7:56 AM, Bergquist, Brett wrote:
> Thanks for the input!
>
> There is no possibility of running the consistency check on the customer's database on
their system as it needs to be running 24x7 and cannot be taken down.  As far as I can tell
at this point, the database came back up ok after the restart and is operating normally.
>
> I am able to get a copy of the database via file system backup that occurs each night.
  Using ZFS allows us to do this by freezing the database (using FREEZE derby calls), doing
a ZFS snapshot of the file system, unfreezing the database (using UNFREEZE derby call) and
then accessing the ZFS snapshot to make the file system backup.   It takes me a couple of
days to get all of the database transferred but then I can stage it locally and run a consistency
check on the local copy.
>
> I will open a JIRA on the NullPointerException's that were reported after Derby did its
shutdown like Bryan suggested.
>
> For some background, the database is used in a telecommunications environment, being
the persistent storage for the configuration for about 90K pieces of network equipment and
receives about 10M monitoring updates per day 24x7.   The database has been around for about
8 years continually growing and having derby being upgraded.  It is currently at 10.10.2.0.
  We also do a poor man's partitioning in that we have 53 database tables, one for each week
of the year, and our 10M inserts are directed to the correct database table for the week of
the year and queries are built upon those weeks as well with a VIEW that is created as a UNION
query across all 53 tables when needed to process queries that span weeks.   We needed to
do this as there was no practical way of deleting older data while simultaneously inserting
data into the table at the rate or 10M/day and not having database performance issues, locking
contention, and even getting the deletions done in a reasonable amount of time, and also recovering
and reusing the freed database space.   Now we simply truncate the tables that are to be purged
which is nearly instantaneous.   At some point I may investigate and contact the group here
on how one might implement a real partitioning scheme that would be more efficient especially
on the queries and add this capability back into derby, so if anyone has any ideas on this,
I am all ears.
>
> Brett
>
> -----Original Message-----
> From: mike matrigali [mailto:mikemapp1@gmail.com]
> Sent: Friday, September 04, 2015 12:37 AM
> To: derby-dev@db.apache.org
> Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832))
could not be read from disk."
>
> I agree with all of bryan's suggestions.  If you can't get access to the actual db there
is not much to be done.  My usual customer support answer to this situation would be to tell
you to shut the db and do a consistency check on it, which would read every page from the
table and would certainly run into the error you got eventually if there was a persistent
problem.
> Given the size of the db and that derby has no optimizations for db's of this size that
is likely to take some time.
>
>   From the stack I can tell you that the problem is in a base page, not an index.  Which
is
> much harder to fix if it is persistent.   In derby db's the output
>    Container(0, 30832) is saying container in segment 0 (seg0 directory) and container
id
> 30832  (impressed by the number of containers that db has gone through).  Also you will
see system catalog talk about conglomerate numbers.  In derby currently there is always a
1-1 mapping of conglomerate num to container number.
> Ancient history, in cloudscape we thought we might need the abstraction and it was a
pain to do the map at the lowest level so we took the opportunity when we redid the arch to
make it 1-1 for "now" but allow a map if anyone wanted to do one in the future:
> And here is a note from bryan minus 6 years on how to go from that number in the error
to file name and table name.:
> http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-derby-db.html
>
> A quick check if you could get a ls -l of the seg0 directory would be to look at the
size of the associated file and do the math bryan mentioned to see if the file now has a full
page.
> including the page size if you figure it out would help as derby page size vs file system
page size can be an issue  - but usually only on machine crashes.
>
> I would suggest filing a JIRA for this.  If it really is the case that you got the I/O
error for a non-persistent problem it may be that derby can be improved to avoid it.  Before
the code was changed to use FileChannel's derby often had retry loops on I/O errors - especially
on reads of pages from disk.  In the long past this just avoided some intermittent i/o problems
that were in most case network related (even though we likely did not support the network
disk officially).  Not sure if the old retry code is still around in the trunk as it was for
running in older JVM's.
>
> Also I have also seen wierd timing errors from maybe multiple processing accessing the
same file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.
>
> Getting a partial page read is a very weird error for derby as it goes out of its way
to write only full pages.
> On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
>> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>>> Reached end of file while attempting to read a whole page
>> You should probably take a close read through all the discussion on
>> this slightly old Derby JIRA Issue:
>>
>> https://issues.apache.org/jira/browse/DERBY-5234
>>
>> There are some suggestions about how to diagnose the conglomerate in
>> question in more detail, and also some observations about possible
>> causes and possible courses of action you can take subsequently.
>>
>> thanks,
>>
>> bryan
>>
>>
>
> --
> email:    Mike Matrigali - mikemapp1@gmail.com
> linkedin: https://www.linkedin.com/in/MikeMatrigali
>
>
> Canoga Perkins
> 20600 Prairie Street
> Chatsworth, CA 91311
> (818) 718-6300
>
> This e-mail and any attached document(s) is confidential and is intended only for the
review of the party to whom it is addressed. If you have received this transmission in error,
please notify the sender immediately and discard the original message and any attachment(s).
>


-- 
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali



Mime
View raw message