Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of amp@opendns.com designates
 67.215.68.163 as permitted sender)
Message-ID: <4E03CEFA.20603@opendns.com>
Date: Thu, 23 Jun 2011 16:40:42 -0700
From: Adam Phelps <amp@opendns.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
 rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
MIME-Version: 1.0
To: CDH Users <cdh-user@cloudera.org>, user@hbase.apache.org
Subject: LoadIncrementalHFiles bug when regionserver fails to access file?
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

(As a note, this is with CDH3u0 which is based on HBase 0.90.1)

We've been seeing intermittent failures of calls to 
LoadIncrementalHFiles.  When this happens the node that made the call 
will see a FileNotFoundException such as this:

2011-06-23 15:47:34.379566500 java.net.SocketTimeoutException: Call to 
s8.XXX/67.215.90.38:60020 failed on socket timeout exception: 
java.net.SocketTi
meoutException: 60000 millis timeout while waiting for channel to be 
ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/67.215.90.51:51605 remo
te=s8.XXX/67.215.90.38:60020]
2011-06-23 15:47:34.379570500 java.io.FileNotFoundException: 
java.io.FileNotFoundException: File does not exist: 
/hfiles/2011/06/23/14/domainsranked/TopDomainsRan
k.r3v5PRvK/handling/3557032074765091256
2011-06-23 15:47:34.379573500   at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1602)
2011-06-23 15:47:34.379573500   at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1593)

Over on the regionserver that was loading this we see that it attempted 
to load and hit a 60 second timeout:

2011-06-23 15:45:54,634 INFO org.apache.hadoop.hbase.regionserver.Store: 
Validating hfile at 
hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 
for inclusion in store handling region 
domainsranked,368449:2011/0/03/23:category:ffffffff:com.zynga.static.fishville.facebook,1305890318961.d4925aca7852bed32613a509215d42b
8.
...
2011-06-23 15:46:54,639 INFO org.apache.hadoop.hdfs.DFSClient: Failed to 
connect to /67.215.90.38:50010, add to deadNodes and continue
java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/67.215.90.38:42199 
remote=/67.215.90.38:50010]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readShort(DataInputStream.java:295)

We suspect this particular problem is a resource contention issue on our 
side.  However, the loading process proceeds to rename the file despite 
the failure:

2011-06-23 15:46:54,657 INFO org.apache.hadoop.hbase.regionserver.Store: 
Renaming bulk load file 
hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 
to 
hdfs://namenode.XXX:8020/hbase/domainsranked/d4925aca7852bed32613a509215d42b8/handling/3615917062821145533

And then the LoadIncrementalHFiles tries to load the hfile again:

2011-06-23 15:46:55,684 INFO org.apache.hadoop.hbase.regionserver.Store: 
Validating hfile at 
hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256 
for inclusion in store handling region 
domainsranked,368449:2011/05/03/23:category:ffffffff:com.zynga.static.fishville.facebook,1305890318961.d4925aca7852bed32613a509215d42b8.

2011-06-23 15:46:55,685 DEBUG org.apache.hadoop.ipc.HBaseServer: IPC 
Server handler 147 on 60020, call 
bulkLoadHFile(hdfs://namenode.XXX/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256, 
[B@4224508b, [B@5e23f799) from 67.215.90.51:51856: error: 
java.io.FileNotFoundException: File does not exist: 
/hfiles/2011/06/23/14/domainsranked/TopDomainsRank.r3v5PRvK/handling/3557032074765091256

This eventually leads to the load command failing.  It feels to me like 
HBase continuing to rename the hfile despite the failure, and then 
LoadIncrementalHFiles trying again is a bug.  I figured I'd ask here 
before opening a ticket for this.

- Adam