Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 98894 invoked from network); 25 Mar 2009 14:41:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Mar 2009 14:41:27 -0000 Received: (qmail 39558 invoked by uid 500); 25 Mar 2009 14:41:27 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 39532 invoked by uid 500); 25 Mar 2009 14:41:27 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 39522 invoked by uid 99); 25 Mar 2009 14:41:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2009 14:41:27 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zsongbo@gmail.com designates 209.85.198.239 as permitted sender) Received: from [209.85.198.239] (HELO rv-out-0506.google.com) (209.85.198.239) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2009 14:41:18 +0000 Received: by rv-out-0506.google.com with SMTP id k40so56936rvb.29 for ; Wed, 25 Mar 2009 07:40:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=A/Gf2J9ZrDaUgGhusUgYqbkCfWdDtsTacvLCj2tnU2M=; b=uHhTh/nP9RQYQSsg17hMxtN0cYwVjxchXFgzw70ioEQdYYV3RYBncfeCWxxP6+m4z0 /HKlDrftHxufU2AXY/NN+iWgJNUbbyRRRlqDWX8mOCk3y0D5QAn5QR1nha87YYTFRuHN rFOuNBbeCZUu0Rhuyq/Re1XIYr4N++z23SCwY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=j/jh+2+GLTgbOjykbALkD5zmBLc+Grxs9nPdEI9KdmCnYpTCjVukS3YHfPCXJ6Y194 A8jQAEGVSPsx+IEFYYIeaK1sPMlq29pnM8ymR4ApE8Wp3x96Z1kpvHs4/xZ1odOhR8gL fnNK/wdSIaCcszdbwIiXVH1R3LuBhOFniNU3E= MIME-Version: 1.0 Received: by 10.114.211.1 with SMTP id j1mr6532493wag.176.1237992056489; Wed, 25 Mar 2009 07:40:56 -0700 (PDT) In-Reply-To: <7c962aed0903250636p50b69dc0ib8d66b115ebdf3f3@mail.gmail.com> References: <7c962aed0903061115h2ceca346w585eb27eeb26a5ce@mail.gmail.com> <7c962aed0903250636p50b69dc0ib8d66b115ebdf3f3@mail.gmail.com> Date: Wed, 25 Mar 2009 22:40:56 +0800 Message-ID: Subject: Re: Data lost during intensive writes From: schubert zhang To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016364584f6c306990465f27cc1 X-Virus-Checked: Checked by ClamAV on apache.org --0016364584f6c306990465f27cc1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Following is what I had send to J-D in another email thread, I will check more logs of 3.24-25. 2009-03-23 10:07:57,465 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call addBlock(/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436, DFSClient_629567488) from 10.24.1.18:59685: error: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436 org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(Unknown Source) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(Unknown Source) at org.apache.hadoop.ipc.Server$Handler.run(Unknown Source) 2009-03-23 10:07:57,552 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to blk_8246919716767617786_109126 size 1048576 2009-03-23 10:07:57,552 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to blk_8246919716767617786_109126 size 1048576 2009-03-23 10:07:57,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/log_10.24.1.16_1237686658208_60020/hlog.dat.1237774044443. blk_45871727940505900_109126 2009-03-23 10:07:57,688 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to blk_2378060095065607252_109126 size 1048576 2009-03-23 10:07:57,688 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:10.24.1.14:50010 is added to blk_2378060095065607252_109126 size 1048576 2009-03-23 10:07:57,689 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/log_10.24.1.14_1237686648061_60020/hlog.dat.1237774036841. blk_8448212226292209521_109126 2009-03-23 10:07:57,869 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call addBlock(/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436, DFSClient_629567488) from 10.24.1.18:59685: error: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436 org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(Unknown Source) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(Unknown Source) at org.apache.hadoop.ipc.Server$Handler.run(Unknown Source) 2009-03-23 10:07:57,944 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:10.24.1.18:50010 is added to blk_1270075611008480481_109121 size 1048576 I cannot find useful info in datanode's logs at the time point. But I find something else, for examples: 2009-03-23 10:08:09,321 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.24.1.20:50010, storageID=DS-2136798339-10.24.1.20-50010-1237686444430, infoPort=50075, ipcPort=50020):Failed to transfer blk_-4099352067684877111_109151 to 10.24.1.18:50010 got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:418) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:519) at org.apache.hadoop.net.SocketOutputStream.transferToFully(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(Unknown Source) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: Connection reset by peer ... 8 more and. 2009-03-23 10:10:17,313 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.24.1.20:50010, storageID=DS-2136798339-10.24.1.20-50010-1237686444430, infoPort=50075, ipcPort=50020):DataXceiver org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-6347382571494739349_109326 is valid, and cannot be written to. at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(Unknown Source) at java.lang.Thread.run(Thread.java:619) On Wed, Mar 25, 2009 at 9:36 PM, stack wrote: > On Wed, Mar 25, 2009 at 2:01 AM, schubert zhang wrote: > > > > But the two > > exceptions start to happen earlyer. > > > > Which two exceptions Schubert? > > > hadoop-0.19 > > hbase-0.19.1 (with patch > > https://issues.apache.org/jira/browse/HBASE-1008)< > https://issues.apache.org/jira/browse/HBASE-1008%29> > > . > > > > I want to try to set dfs.datanode.socket.write.timeout=0 and watch it > > later. > > > Later you ask, ' if set "dfs.datanode.socket.write.timeout=0", hadoop will > always create new socket, is it ok?' I traced write.timeout and looks like > it becomes the socket timeout -- no other special handling seems to be > done. Perhaps I am missing something? To what are you referring? > > Thanks, > St.Ack > --0016364584f6c306990465f27cc1--