hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elmar Grote" <elmar.gr...@optivo.de>
Subject fsimage.ckpt are not deleted - Exception in doCheckpoint
Date Fri, 01 Mar 2013 08:36:47 GMT
Hi,

we are writing our fsimage and edits file on the namenode and secondary namenode and additional
on a nfs share.

In these folders we found a a lot of fsimage.ckpt_000000000........
. files, the oldest is from 9. Aug 2012. 
As far a i know these files should be deleted after the secondary namenodes creates the new
fsimage file. 
I looked in our log files from the namenode and secondary namenode to see what happen at that
time.

As example i searched for this file:
20. Feb 04:02 fsimage.ckpt_0000000000726216952

In the namenode log i found this:
2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hdfs (auth:SIMPLE) cause:java.io.IOException: Input/output error
2013-02-20   04:02:51,409 WARN org.mortbay.log: /getimage: java.io.IOException:   GetImage
failed. java.io.IOException: Input/output error

In the secondary namenode i think this is the relevant part:
2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Image
has not changed. Will not download image.
2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening
connection to http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded
file edits_0000000000726172233-0000000000726216952 size 6881797 bytes.
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer
about to load edits from 1 stream(s).
2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
expecting start txid #726172233
2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
of size 6881797 edits # 44720 loaded in 0 seconds.
2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image
file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952
using no compression
2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image
file /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952 using
no compression
2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of
size 1211973003 saved in 22 seconds.
2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of
size 1211973003 saved in 32 seconds.
2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Going to retain 2 images with txid >= 726172232
2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Purging old image FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726121750,
cpktTxId=0000000000726121750)
2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Purging old image FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726121750,
cpktTxId=0000000000726121750)
2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725172233
2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725172233
2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening
connection to http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception
in doCheckpoint
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException: Image transfer
servlet at http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
failed with status code 410
Response message:
GetImage failed. java.io.IOException: Input/output error  at sun.nio.ch.FileChannelImpl.force0(Native
Method)  at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:358)  at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:303)
 at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadImageToStorage(TransferFsImage.java:75)
 at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:169)
 at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:111)
 at java.security.AccessController.doPrivileged(Native Method)  at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) 
at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:111)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
 at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:947)  at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)  at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)  at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  at org.mortbay.jetty.Htt
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:216)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:126)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:478)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
    at java.lang.Thread.run(Thread.java:619)
2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening
connection to http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded
file fsimage.ckpt_0000000000726172232 size 1212016242 bytes.
2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage:   Skipping
download of remote edit log [726172233,726216952] since it   already is stored locally at
/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening
connection to http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded
file edits_0000000000726216953-0000000000726262269 size 7013503 bytes.
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loading image
file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232 using
no compression
2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files
= 8795086
2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files
under construction = 32
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of
size 1212016242 loaded in 35 seconds.
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image
for txid 726172232 from /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer
about to load edits from 2 stream(s).
2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
expecting start txid #726172233
2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
of size 6881797 edits # 44720 loaded in 0 seconds.
2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269
expecting start txid #726216953
2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269
of size 7013503 edits # 45317 loaded in 0 seconds.
2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image
file /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269
using no compression
2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image
file /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269 using
no compression
2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of
size 1212107279 saved in 23 seconds.
2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of
size 1212107279 saved in 30 seconds.
2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Going to retain 2 images with txid >= 726216952
2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Purging old image FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232,
cpktTxId=0000000000726172232)
2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Purging old image FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726172232,
cpktTxId=0000000000726172232)
2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725216953
2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725216953
2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening
connection to http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded
image with txid 726262269 to namenode at s_namenode.domain.local:50070
2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint
done. New Image Size: 1212107279
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager:
Going to retain 2 images with txid >= 726216952
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725216953
2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging
logs older than 725216953

At that time we are copying our whole data from this cluster to a second cluster. So we are
reading a lot in this cluster.
In   our monitoring graphs i can not see any peaks at the time this happens.   Only the secondary
namenode takes more memory than usual, about 1G   more. 

What happen here? Was the creation of the new fsimage successful? If so why where the old
fsimag.ckpt not deleted?
Or did we lose some data?

Regards Elmar
Mime
View raw message