hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <jason.had...@gmail.com>
Subject Re: Problem with the secondary namenode being unable to getImage?putImage due to edit file time differences, hadoop 0.19.0
Date Thu, 15 Oct 2009 04:28:20 GMT
I have some more detail now:
I had to hack the Transfer class to actually log the error message when the
http requests failed, a good thing to do.

Also I am going ot put something into the dfshealth jsp that indicates the
age of the fsimage and the size of the log files and flags it if the fsimage
should have been updated.

I destroyed an hdfs cluster this week where the secondary had fallen over 2
months ago and no one notices.
I interrupted the namenode when it was restarting after a crash due to an
operational issue, and the next restart did not recover the edit logs, and
by the time I realized that the cluster had rolled back the fsimage, two
months, it was to late.

-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
d0/dfs/name/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.000000000 +0000
d0/dfs/name/current/edits.new
-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
d1/dfs/name/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.000000000 +0000
d1/dfs/name/current/edits.new
-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
/local/data/hadoop/namenode/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.187993000 +0000
/local/data/hadoop/namenode/current/edits.new



-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
d0/dfs/name/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.000000000 +0000
d0/dfs/name/current/edits.new
-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
d1/dfs/name/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.000000000 +0000
d1/dfs/name/current/edits.new
-rw-rw-rw- 1 --core --core    515704 2009-10-14 02:54:25.000000000 +0000
/local/data/hadoop/namenode/current/edits
-rw-rw-rw- 1 --core --core 159229404 2009-10-14 23:53:09.187993000 +0000
/local/data/hadoop/namenode/current/edits.new

2009-10-14 23:56:10,156 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
0 Total time for transactions(ms): 0 Number of syncs: 0 SyncTimes(ms): 0
2009-10-14 23:56:10,176 INFO org.apache.hadoop.hdfs.server.common.Storage:
Completing previous checkpoint for storage directory
/local/data/hadoop/d0/tmp/dfs/namesecondary.
2009-10-14 23:56:52,870 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
fsimage size 700677172 bytes.
2009-10-14 23:56:52,936 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
edits size 515704 bytes.
2009-10-14 23:56:53,296 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
fsOwner=--core,--core,openapp,clsdapp
2009-10-14 23:56:53,297 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2009-10-14 23:56:53,297 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=true
2009-10-14 23:56:53,305 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 4976865
^Z
[1]+  Stopped                 /usr/local/bin/gtail -F
launcher-secondarynamenode.log
~--core/log/hadoop---core-z10299b.ZZZZops.com.log  (wd:
/local/home/--core/deploy/current)
(wd now: /local/data/hadoop)
eng@hadoop.dsp.--o:/local/data/hadoop:!find
find . -name edit'*' -ls
78913  131 -rw-rw-rw-   1 --core   --core          4 Oct 14 23:56
./d0/tmp/dfs/namesecondary/lastcheckpoint.tmp/edits
78919  515 -rw-rw-rw-   1 --core   --core     515704 Oct 14 23:56
./d0/tmp/dfs/namesecondary/current/edits
78905  131 -rw-rw-rw-   1 --core   --core          4 Oct 14 23:37
./d0/tmp/dfs/namesecondary/previous.checkpoint/edits
23797    1 -rw-rw-rw-   1 --core   --core          4 Feb 21  2009
./d0/tmp/dfs/namesecondary/edits.tmp

2009-10-14 23:58:28,876 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 682
2009-10-14 23:58:34,702 INFO org.apache.hadoop.hdfs.server.common.Storage:
Edits file /local/data/hadoop/d0/tmp/dfs/namesecondary/current/edits of size
515704 edits # 3337 loaded in 5 seconds.
2009-10-14 23:58:44,401 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 700774610 saved in 8 seconds.
2009-10-14 23:58:51,038 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
0 Total time for transactions(ms): 0 Number of syncs: 0 SyncTimes(ms): 0
2009-10-14 23:58:53,238 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
namenode.ZZZZinc.com:50070
putimage=1&port=50090&machine=XX.16.YY.119&token=-18:47734869:1235266738838:1255564569000:1255488167530
2009-10-14 23:58:53,264 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
doCheckpoint:
2009-10-14 23:58:53,264 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
java.io.IOException: Unable to connect to
http://namenode.ZZZZinc.com:50070/getimage?putimage=1&port=50090&machine=XX.16.YY.119&token=-18:47734869:1235266738838:1255564569000:1255488167530:
410 -
GetImage+failed%2E+java%2Eio%2EIOException%3A+Namenode+has+an+edit+log+with+timestamp+of+2009%2D10%2D14+23%3A58%3A53+but+new+checkpoint+was+created+using+editlog++with+timestamp+2009%2D10%2D14+23%3A56%3A09%2E+Checkpoint+Aborted%2E%0A%09at+org%2Eapache%2Ehadoop%2Ehdfs%2Eserver%2Enamenode%2EFSImage%2EvalidateCheckpointUpload%28FSImage%2Ejava%3A1362%29%0A%09at+org%2Eapache%2Ehadoop%2Ehdfs%2Eserver%2Enamenode%2EGetImageServlet%2EdoGet%28GetImageServlet%2Ejava%3A57%29%0A%09at+javax%2Eservlet%2Ehttp%2EHttpServlet%2Eservice%28HttpServlet%2Ejava%3A689%29%0A%09at+javax%2Eservlet%2Ehttp%2EHttpServlet%2Eservice%28HttpServlet%2Ejava%3A802%29%0A%09at+org%2Emortbay%2Ejetty%2Eservlet%2EServletHolder%2Ehandle%28ServletHolder%2Ejava%3A427%29%0A%09at+org%2Emortbay%2Ejetty%2Eservlet%2EWebApplicationHandler%2Edispatch%28WebApplicationHandler%2Ejava%3A475%29%0A%09at+org%2Emortbay%2Ejetty%2Eservlet%2EServletHandler%2Ehandle%28ServletHandler%2Ejava%3A567%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpContext%2Ehandle%28HttpContext%2Ejava%3A1565%29%0A%09at+org%2Emortbay%2Ejetty%2Eservlet%2EWebApplicationContext%2Ehandle%28WebApplicationContext%2Ejava%3A635%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpContext%2Ehandle%28HttpContext%2Ejava%3A1517%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpServer%2Eservice%28HttpServer%2Ejava%3A954%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpConnection%2Eservice%28HttpConnection%2Ejava%3A814%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpConnection%2EhandleNext%28HttpConnection%2Ejava%3A981%29%0A%09at+org%2Emortbay%2Ehttp%2EHttpConnection%2Ehandle%28HttpConnection%2Ejava%3A831%29%0A%09at+org%2Emortbay%2Ehttp%2ESocketListener%2EhandleConnection%28SocketListener%2Ejava%3A244%29%0A%09at+org%2Emortbay%2Eutil%2EThreadedServer%2Ehandle%28ThreadedServer%2Ejava%3A357%29%0A%09at+org%2Emortbay%2Eutil%2EThreadPool%24PoolThread%2Erun%28ThreadPool%2Ejava%3A534%29%0A

        at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:154)
        at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:280)
        at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:320)
        at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
        at java.lang.Thread.run(Thread.java:619)

http://namenode.ZZZZinc.com:50070/getimage?putimage=1&port=50090&machine=XX.16.YY.119&token=-18:47734869:1235266738838:1255564569000:1255488167530:
410 -
GetImage+failed.+java.io.IOException:+Namenode+has+an+edit+log+with+timestamp+of+2009-10-14+23:58:53+but+new+checkpoint+was+created+using+editlog++with+timestamp+2009-10-14+23:56:09.+Checkpoint+Aborted.

at+org.apache.hadoop.hdfs.server.namenode.FSImage.validateCheckpointUpload(FSImage.java:1362)

at+org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:57)
    at+javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
    at+javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

at+org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)

at+org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)

at+org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
    at+org.mortbay.http.HttpContext.handle(HttpContext.java:1565)

at+org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
    at+org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
    at+org.mortbay.http.HttpServer.service(HttpServer.java:954)
    at+org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
    at+org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
    at+org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)

at+org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
    at+org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
    at+org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


On Tue, Oct 13, 2009 at 11:09 PM, Jason Venner <jason.hadoop@gmail.com>
wrote:

    I might attach to the secondary in the debugger so I can catch more
details tomorrow.


    On Tue, Oct 13, 2009 at 11:08 PM, Jason Venner <jason.hadoop@gmail.com>
wrote:

        That is the entire log message,
        The only way to get the actual error is via tcpdump and packet
reassembly

        I can't find a getImage log messages on the namenode, they would be
in the jetty error log, but I think that is set to /dev/null.




        On Tue, Oct 13, 2009 at 10:57 PM, Raghu Angadi <rangadi@apache.org>
wrote:


            It should not be very sensitive to drift. I think timestamp is
meant mostly as an increasing number. implementation might have a bug. will
check.

            Could paste a larger portion of the log?

            Raghu.

            On Tue, Oct 13, 2009 at 10:47 PM, Jason Venner <
jason.hadoop@gmail.com> wrote:

                The actual body of the error contains a statement to the
effect that the

                     Namenode has an edit log with timestamp of ..... but
new checkpoint was created using editlog with timestamp ....

                This is from FSImage.validateCheckpointUpload. The drift is
about the difft in time in the edit logs.

                THe unique situation in my experience here is that one of
the dfs.name.dir directories is on an nfs mounted file system, and that
particular file has a mode time that is often 0.5+/-0.5 seconds different.

                Any simple fixes? Even with the best ntp, we are going to be
+/- a few msec and the test in validateCheckpointUpoad is exact equals
comparison.
                The other thing that confuses me is why, the time stamp is
differenet, the code from a reading, appears to always pick the same file
for modtime comparisons.
                --
                Pro Hadoop, a book to guide you from beginner to hadoop
mastery,
                http://www.amazon.com/dp/1430219424?tag=jewlerymall
                www.prohadoopbook.com a community for Hadoop Professionals





        --
        Pro Hadoop, a book to guide you from beginner to hadoop mastery,
        http://www.amazon.com/dp/1430219424?tag=jewlerymall
        www.prohadoopbook.com a community for Hadoop Professionals




    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals

Mime
View raw message