hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: fsimage.ckpt are not deleted - Exception in doCheckpoint
Date Wed, 20 Mar 2013 07:41:46 GMT
I don't think there's a data loss here. However, I think you may have been
affected by https://issues.apache.org/jira/browse/HDFS-4301 due to a large
fsimage size perhaps, which you can workaround by increasing the default
timeout via property dfs.image.transfer.timeout (default 60000ms, i.e. 1
minute) to > 10 minutes or so in ms value.

On Fri, Mar 1, 2013 at 2:06 PM, Elmar Grote <elmar.grote@optivo.de> wrote:

> **
> Hi,
>
> we are writing our fsimage and edits file on the namenode and secondary
> namenode and additional on a nfs share.
>
> In these folders we found a a lot of fsimage.ckpt_000000000........
> **. files, the oldest is from 9. Aug 2012.
> As far a i know these files should be deleted after the secondary
> namenodes creates the new fsimage file.
> I looked in our log files from the namenode and secondary namenode to see
> what happen at that time.
>
> As example i searched for this file:
> 20. Feb 04:02 fsimage.ckpt_**0000000000726216952
>
> In the namenode log i found this:
> 2013-02-20 04:02:51,404 ERROR org.apache.hadoop.security.**UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Input/output error
> 2013-02-20 04:02:51,409 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.io.IOException: Input/output
> error
>
> In the secondary namenode i think this is the relevant part:
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Image has not changed. Will not download image.
> 2013-02-20 04:01:16,554 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726172233&endTxId=**
> 726216952&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726172233&endTxId=726216952&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726172233-**0000000000726216952 size
> 6881797 bytes.
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 1 stream(s).
> 2013-02-20 04:01:16,750 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:01:16,987 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:01:18,023 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726216952 using no compression
> 2013-02-20 04:01:18,031 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726216952 using no
> compression
> 2013-02-20 04:01:40,854 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 22 seconds.
> 2013-02-20 04:01:50,762 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1211973003 saved in 32 seconds.
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726172232
> 2013-02-20 04:01:50,770 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,000 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726121750,
> cpktTxId=0000000000726121750)
> 2013-02-20 04:01:51,379 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,381 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725172233
> 2013-02-20 04:01:51,400 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:02:51,411 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Exception in doCheckpoint
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage$**HttpGetFailedException:
> Image transfer servlet at http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726216952&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726216952&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>failed
with status code 410
> Response message:
> GetImage failed. java.io.IOException: Input/output error  at
> sun.nio.ch.FileChannelImpl.**force0(Native Method)  at
> sun.nio.ch.FileChannelImpl.**force(FileChannelImpl.java:**358)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:303)  at
> org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> downloadImageToStorage(**TransferFsImage.java:75)  at
> org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:169)
> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**run(GetImageServlet.java:111)
> at java.security.**AccessController.doPrivileged(**Native Method)  at
> javax.security.auth.Subject.**doAs(Subject.java:396)  at
> org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1232)  at org.apache.hadoop.hdfs.server.**
> namenode.GetImageServlet.**doGet(GetImageServlet.java:**111)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:707)  at
> javax.servlet.http.**HttpServlet.service(**HttpServlet.java:820)  at
> org.mortbay.jetty.servlet.**ServletHolder.handle(**ServletHolder.java:511)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1221)  at org.apache.hadoop.http.**
> HttpServer$QuotingInputFilter.**doFilter(HttpServer.java:947)  at
> org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399)  at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)  at org.mortbay.jetty.handler.**
> HandlerWrapper.handle(**HandlerWrapper.java:152)  at
> org.mortbay.jetty.Server.**handle(Server.java:326)  at org.mortbay.jetty.*
> *HttpConnection.handleRequest(**HttpConnection.java:542)  at
> org.mortbay.jetty.Htt
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> getFileClient(TransferFsImage.**java:216)
>     at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
> uploadImageFromStorage(**TransferFsImage.java:126)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doCheckpoint(**SecondaryNameNode.java:478)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> doWork(SecondaryNameNode.java:**334)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$2.**
> run(SecondaryNameNode.java:**301)
>     at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**
> SecurityUtil.java:438)
>     at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
> run(SecondaryNameNode.java:**297)
>     at java.lang.Thread.run(Thread.**java:619)
> 2013-02-20 04:04:52,592 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getimage=**1&txid=726172232&storageInfo=-**
> 40:1814856193:1341996094997:**CID-064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getimage=1&txid=726172232&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:36,976 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file fsimage.ckpt_**0000000000726172232 size 1212016242 bytes.
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Skipping download of remote edit log [726172233,726216952] since it already
> is stored locally at /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**edits_0000000000726172233-**
> 0000000000726216952
> 2013-02-20 04:05:37,595 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?getedit=**1&startTxId=726216953&endTxId=**
> 726262269&storageInfo=-40:**1814856193:1341996094997:CID-**
> 064c4e47-387d-454d-aa1e-**27cec1e816e4<http://s_namenode.domain.local:50070/getimage?getedit=1&startTxId=726216953&endTxId=726262269&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Downloaded file edits_0000000000726216953-**0000000000726262269 size
> 7013503 bytes.
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loading image file /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232 using no
> compression
> 2013-02-20 04:05:38,339 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files = 8795086
> 2013-02-20 04:06:13,678 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Number of files under construction = 32
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212016242 loaded in 35 seconds.
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Loaded image for txid 726172232 from /var/lib/hdfs_namenode/meta/**
> dfs/namesecondary/current/**fsimage_0000000000726172232
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.Checkpointer:
> Checkpointer about to load edits from 2 stream(s).
> 2013-02-20 04:06:13,679 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 expecting start txid
> #726172233
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726172233-**0000000000726216952 of size 6881797 edits #
> 44720 loaded in 0 seconds.
> 2013-02-20 04:06:14,038 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Reading /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 expecting start txid
> #726216953
> 2013-02-20 04:06:14,372 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Edits file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/**
> edits_0000000000726216953-**0000000000726262269 of size 7013503 edits #
> 45317 loaded in 0 seconds.
> 2013-02-20 04:06:15,285 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_namenode/meta/**dfs/namesecondary/current/
> **fsimage.ckpt_**0000000000726262269 using no compression
> 2013-02-20 04:06:15,289 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Saving image file /var/lib/hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage.**ckpt_0000000000726262269 using no
> compression
> 2013-02-20 04:06:38,530 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 23 seconds.
> 2013-02-20 04:06:45,380 INFO org.apache.hadoop.hdfs.server.**namenode.FSImage:
> Image file of size 1212107279 saved in 30 seconds.
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:06:45,406 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_namenode/meta/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:45,646 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Purging old image FSImageFile(file=/var/lib/**hdfs_nfs_share/dfs/**
> namesecondary/current/fsimage_**0000000000726172232,
> cpktTxId=0000000000726172232)
> 2013-02-20 04:06:46,115 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,118 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:06:46,145 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Opening connection to http://s_namenode.domain.**
> local:50070/getimage?putimage=**1&txid=726262269&port=50090&**
> storageInfo=-40:1814856193:**1341996094997:CID-064c4e47-**
> 387d-454d-aa1e-27cec1e816e4<http://s_namenode.domain.local:50070/getimage?putimage=1&txid=726262269&port=50090&storageInfo=-40:1814856193:1341996094997:CID-064c4e47-387d-454d-aa1e-27cec1e816e4>
> 2013-02-20 04:07:31,010 INFO org.apache.hadoop.hdfs.server.**namenode.TransferFsImage:
> Uploaded image with txid 726262269 to namenode at
> s_namenode.domain.local:50070
> 2013-02-20 04:07:31,011 WARN org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
> Checkpoint done. New Image Size: 1212107279
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.**NNStorageRetentionManager:
> Going to retain 2 images with txid >= 726216952
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
> 2013-02-20 04:07:31,013 INFO org.apache.hadoop.hdfs.server.**namenode.FileJournalManager:
> Purging logs older than 725216953
>
> At that time we are copying our whole data from this cluster to a second
> cluster. So we are reading a lot in this cluster.
> In our monitoring graphs i can not see any peaks at the time this happens.
> Only the secondary namenode takes more memory than usual, about 1G more.
>
> What happen here? Was the creation of the new fsimage successful? If so
> why where the old fsimag.ckpt not deleted?
> Or did we lose some data?
>
> Regards Elmar
>



-- 
Harsh J

Mime
View raw message