hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Estes <james.es...@gmail.com>
Subject Re: Missing region data.
Date Mon, 09 Jan 2012 21:57:26 GMT
Should we file a ticket for this issue?  FWIW we got this fixed (not
sure if we actually lost any data though). We had to bounce the region
server (non-gracefully). The region server seemed to have some stale
file handles into hdfs...open inputstreams to files that were long
deleted in hdfs.  Any compactions or anything that would hit the
region would fail b/c it wigged out on the stale handles.  Even a
graceful shutdown would get stuck on it.  Shutting it down directly
worked, because it comes back up and resets the handles (i guess?).

So, should we file a ticket for this issue?  I'm not sure how we got
in this state, but perhaps there can be some way to recover in the
code if it occurs?  We actually tried to repro by deleting a file
straight out of hdfs, but it didn't seem to trigger the issue (but we
tried this in cdh3u2, but had the issue in cdh3u1).

Thanks,
James

On Thu, Dec 22, 2011 at 2:34 PM, James Estes <james.estes@gmail.com> wrote:
> We have a 6 node 0.90.3-cdh3u1 cluster.  We have 8092 regions.  I
> realize we have too many regions and too few nodes…we're addressing
> that.  We currently have an issue where we seem to have lost region
> data.  When data is requested for a couple of our regions, we get
> errors like the following on the client:
>
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed 1 action: IOException: 1 time, servers with issues:
> node13host:60020
> …
> java.io.IOException: java.io.IOException: Could not seek
> StoreFileScanner[HFileScanner for reader
> reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568,
> compression=none, inMemory=false,
> firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put,
> lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put,
> avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405,
> cur=null]
> …
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
> On node13host, we see similar exceptions:
>
> 2011-12-22 02:25:27,509 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node13host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:37847, remote=
> /node13host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,511 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node08host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:44290, remote=
> /node08host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,512 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node10host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:52113, remote=
> /node10host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,513 INFO org.apache.hadoop.hdfs.DFSClient: Could
> not obtain block blk_-7065741853936038270_15820239 from any node:
> java.io.IOException: No live nodes contain current block. Will get new
> block locations from namenode and retry...
> 2011-12-22 02:25:30,515 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Could not seek StoreFileScanner[HFileScanner for
> reader reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568,
> compression=none, inMemory=false,
> firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put,
> lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put,
> avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405,
> cur=null]
> …
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
>
> The file referenced is indeed not in hdfs.  Grepping further back in
> the logs reveals that the problem has been occuring for over a week
> (likely longer, but the logs have rolled off).  There are a bunch of
> files in /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/ (270 of
> them),  unsure why they aren't compacting, I looked further in the
> logs and find similar exceptions when trying to do a major compaction,
> ultimately failing b/c of:
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
> Any help on how to recover?  hbck did identify some inconsistencies,
> we went forward with a -fix, but the issue remains.

Mime
View raw message