hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elmar Grote" <elmar.gr...@optivo.de>
Subject Re: fsimage.ckpt are not deleted - Exception in doCheckpoint
Date Fri, 08 Mar 2013 13:43:29 GMT
Hi Yifan,

thank you for the answer.

But as far as i understand the SN downloads the fsimage and edits files from NN,
build the new fsimage in uploads it to the NN.

So here the upload didn't work. The next time the creation starts there is the old fsimage
on the NN.
But what about the edits files ? Are the old ones still there? Or where they deleted
during the not working upload of the fsimage? If they where deleted the are missing and 
there should be a loss or inconsistence of data.

Or am i wrong?

When will the edits files be deleted? After a successful upload or before?

Regards Elmar

  _____  

From: Yifan Du [mailto:duyifan23@gmail.com]
To: user@hadoop.apache.org
Sent: Fri, 08 Mar 2013 11:08:09 +0100
Subject: Re: fsimage.ckpt are not deleted - Exception in doCheckpoint

I have met this exception too.
  The new fsimage played by SNN could not be transfered to NN.
  My hdfs version is 2.0.0.
  did anyone know how to fix it?
  
  @Regards Elmar
  The new fsimage has been created successfully. But it could not be
  transfered to NN,so the old fsimage.ckpt not deleted.
  I have tried the new fsimage. Startup the cluster with the new fsimage
  and new edits in progress. It's successfully and no data lost.
  
  
  2013/3/6, Elmar Grote <elmar.grote@optivo.de>:
  > Hi,
  >
  > we are writing our fsimage and edits file on the namenode and secondary
  > namenode and additional on a nfs share.
  >
  > In these folders we found a a lot of fsimage.ckpt_000000000........
  > . files, the oldest is from 9. Aug 2012.
  > As far a i know these files should be deleted after the secondary namenodes
  > creates the new fsimage file.
  > I looked in our log files from the namenode and secondary namenode to see
  > what happen at that time.
  >
  > As example i searched for this file:
  > 20. Feb 04:02 fsimage.ckpt_0000000000726216952
  >
  > In the namenode log i found this:
  > 2013-02-20 04:02:51,404 ERROR
  > org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
  > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Input/output error
  > 2013-02-20   04:02:51,409 WARN org.mortbay.log: /getimage:
  > java.io.IOException:   GetImage failed. java.io.IOException: Input/output
  > error
  >
  > In the secondary namenode i think this is the relevant part:
  > 2013-02-20 04:01:16,554 INFO
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Image has not
  > changed. Will not download image.
  > 2013-02-20 04:01:16,554 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection
  > to
  > http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > 2013-02-20 04:01:16,750 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file
  > edits_0000000000726172233-0000000000726216952 size 6881797 bytes.
  > 2013-02-20 04:01:16,750 INFO
  > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer about to
  > load edits from 1 stream(s).
  > 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Reading
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
  > expecting start txid #726172233
  > 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Edits file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
  > of size 6881797 edits # 44720 loaded in 0 seconds.
  > 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Saving image file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952
  > using no compression
  > 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Saving image file
  > /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726216952
  > using no compression
  > 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Image file of size 1211973003 saved in 22 seconds.
  > 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Image file of size 1211973003 saved in 32 seconds.
  > 2013-02-20 04:01:50,770 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to
  > retain 2 images with txid >= 726172232
  > 2013-02-20 04:01:50,770 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging
  > old image
  > FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726121750,
  > cpktTxId=0000000000726121750)
  > 2013-02-20 04:01:51,000 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging
  > old image
  > FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726121750,
  > cpktTxId=0000000000726121750)
  > 2013-02-20 04:01:51,379 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725172233
  > 2013-02-20 04:01:51,381 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725172233
  > 2013-02-20 04:01:51,400 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection
  > to
  > http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > 2013-02-20 04:02:51,411 ERROR
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
  > doCheckpoint
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
  > Image transfer servlet at
  > http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > failed with status code 410
  > Response message:
  > GetImage failed. java.io.IOException: Input/output error  at
  > sun.nio.ch.FileChannelImpl.force0(Native Method)  at
  > sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:358)  at
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:303)
  >  at
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadImageToStorage(TransferFsImage.java:75)
  >  at
  > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:169)
  >  at
  > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:111)
  >  at java.security.AccessController.doPrivileged(Native Method)  at
  > javax.security.auth.Subject.doAs(Subject.java:396)  at
  > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  >  at
  > org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:111)
  >  at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)  at
  > javax.servlet.http.HttpServlet.service(HttpServlet.java:820)  at
  > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)  at
  > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
  >  at
  > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:947)
  >  at
  > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  >  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
  >  at
  > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
  > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
  > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  at
  > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
  >  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
  >  at org.mortbay.jetty.Server.handle(Server.java:326)  at
  > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  at
  > org.mortbay.jetty.Htt
  >     at
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:216)
  >     at
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:126)
  >     at
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:478)
  >     at
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
  >     at
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
  >     at
  > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
  >     at
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
  >     at java.lang.Thread.run(Thread.java:619)
  > 2013-02-20 04:04:52,592 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection
  > to
  > http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > 2013-02-20 04:05:36,976 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file
  > fsimage.ckpt_0000000000726172232 size 1212016242 bytes.
  > 2013-02-20 04:05:37,595 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage:   Skipping download
  > of remote edit log [726172233,726216952] since it   already is stored
  > locally at
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
  > 2013-02-20 04:05:37,595 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection
  > to
  > http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > 2013-02-20 04:05:38,339 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file
  > edits_0000000000726216953-0000000000726262269 size 7013503 bytes.
  > 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Loading image file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232
  > using no compression
  > 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Number of files = 8795086
  > 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Number of files under construction = 32
  > 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Image file of size 1212016242 loaded in 35 seconds.
  > 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Loaded image for txid 726172232 from
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232
  > 2013-02-20 04:06:13,679 INFO
  > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Checkpointer about to
  > load edits from 2 stream(s).
  > 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Reading
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
  > expecting start txid #726172233
  > 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Edits file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726172233-0000000000726216952
  > of size 6881797 edits # 44720 loaded in 0 seconds.
  > 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Reading
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269
  > expecting start txid #726216953
  > 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Edits file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/edits_0000000000726216953-0000000000726262269
  > of size 7013503 edits # 45317 loaded in 0 seconds.
  > 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Saving image file
  > /var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269
  > using no compression
  > 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Saving image file
  > /var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage.ckpt_0000000000726262269
  > using no compression
  > 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Image file of size 1212107279 saved in 23 seconds.
  > 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
  > Image file of size 1212107279 saved in 30 seconds.
  > 2013-02-20 04:06:45,406 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to
  > retain 2 images with txid >= 726216952
  > 2013-02-20 04:06:45,406 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging
  > old image
  > FSImageFile(file=/var/lib/hdfs_namenode/meta/dfs/namesecondary/current/fsimage_0000000000726172232,
  > cpktTxId=0000000000726172232)
  > 2013-02-20 04:06:45,646 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging
  > old image
  > FSImageFile(file=/var/lib/hdfs_nfs_share/dfs/namesecondary/current/fsimage_0000000000726172232,
  > cpktTxId=0000000000726172232)
  > 2013-02-20 04:06:46,115 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725216953
  > 2013-02-20 04:06:46,118 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725216953
  > 2013-02-20 04:06:46,145 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection
  > to
  > http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4
  > 2013-02-20 04:07:31,010 INFO
  > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with
  > txid 726262269 to namenode at s_namenode.domain.local:50070
  > 2013-02-20 04:07:31,011 WARN
  > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
  > New Image Size: 1212107279
  > 2013-02-20 04:07:31,013 INFO
  > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to
  > retain 2 images with txid >= 726216952
  > 2013-02-20 04:07:31,013 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725216953
  > 2013-02-20 04:07:31,013 INFO
  > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Purging logs
  > older than 725216953
  >
  > At that time we are copying our whole data from this cluster to a second
  > cluster. So we are reading a lot in this cluster.
  > In   our monitoring graphs i can not see any peaks at the time this happens.
  >   Only the secondary namenode takes more memory than usual, about 1G   more.
  >
  >
  > What happen here? Was the creation of the new fsimage successful? If so why
  > where the old fsimag.ckpt not deleted?
  >
  > Or did we lose some data?
  >
  > Regards Elmar
    
Mime
View raw message