Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 44E03469D for ; Thu, 23 Jun 2011 23:41:17 +0000 (UTC) Received: (qmail 31099 invoked by uid 500); 23 Jun 2011 23:41:15 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 31070 invoked by uid 500); 23 Jun 2011 23:41:15 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31059 invoked by uid 99); 23 Jun 2011 23:41:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 23:41:15 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of amp@opendns.com designates 67.215.68.163 as permitted sender) Received: from [67.215.68.163] (HELO mail.opendns.com) (67.215.68.163) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 23:41:06 +0000 Received: from Adams-Desktop.local ([67.215.69.42]) (authenticated bits=0) by mail.opendns.com (8.14.3/8.14.3/Debian-5) with ESMTP id p5NNegYu005204 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO); Thu, 23 Jun 2011 23:40:42 GMT Message-ID: <4E03CEFA.20603@opendns.com> Date: Thu, 23 Jun 2011 16:40:42 -0700 From: Adam Phelps User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: CDH Users , user@hbase.apache.org Subject: LoadIncrementalHFiles bug when regionserver fails to access file? Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org (As a note, this is with CDH3u0 which is based on HBase 0.90.1) We've been seeing intermittent failures of calls to LoadIncrementalHFiles. When this happens the node that made the call will see a FileNotFoundException such as this: 2011-06-23 15:47:34.379566500 java.net.SocketTimeoutException: Call to s8.XXX/67.215.90.38:60020 failed on socket timeout exception: java.net.SocketTi meoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/67.215.90.51:51605 remo te=s8.XXX/67.215.90.38:60020] 2011-06-23 15:47:34.379570500 java.io.FileNotFoundException: java.io.FileNotFoundException: File does not exist: /hfiles/2011/06/23/14/domainsranked/TopDomainsRan k.r3v5PRvK/handling/3557032074765091256 2011-06-23 15:47:34.379573500 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1602) 2011-06-23 15:47:34.379573500 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1593) Over on the regionserver that was loading this we see that it attempted to load and hit a 60 second timeout: 2011-06-23 15:45:54,634 INFO org.apache.hadoop.hbase.regionserver.Store: Validating hfile at hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 for inclusion in store handling region domainsranked,368449:2011/0/03/23:category:ffffffff:com.zynga.static.fishville.facebook,1305890318961.d4925aca7852bed32613a509215d42b 8. ... 2011-06-23 15:46:54,639 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /67.215.90.38:50010, add to deadNodes and continue java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/67.215.90.38:42199 remote=/67.215.90.38:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readShort(DataInputStream.java:295) We suspect this particular problem is a resource contention issue on our side. However, the loading process proceeds to rename the file despite the failure: 2011-06-23 15:46:54,657 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 to hdfs://namenode.XXX:8020/hbase/domainsranked/d4925aca7852bed32613a509215d42b8/handling/3615917062821145533 And then the LoadIncrementalHFiles tries to load the hfile again: 2011-06-23 15:46:55,684 INFO org.apache.hadoop.hbase.regionserver.Store: Validating hfile at hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 for inclusion in store handling region domainsranked,368449:2011/05/03/23:category:ffffffff:com.zynga.static.fishville.facebook,1305890318961.d4925aca7852bed32613a509215d42b8. 2011-06-23 15:46:55,685 DEBUG org.apache.hadoop.ipc.HBaseServer: IPC Server handler 147 on 60020, call bulkLoadHFile(hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256, [B@4224508b, [B@5e23f799) from 67.215.90.51:51856: error: java.io.FileNotFoundException: File does not exist: /hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 This eventually leads to the load command failing. It feels to me like HBase continuing to rename the hfile despite the failure, and then LoadIncrementalHFiles trying again is a bug. I figured I'd ask here before opening a ticket for this. - Adam