hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rod Cope <rod.c...@openlogic.com>
Subject Re: Failures after a few hours of heavy load
Date Sat, 20 Feb 2010 21:05:17 GMT
Just did this on every box and got pretty much the same answers on each:

[hadoop@dd08 logs]$ ulimit -n; /usr/sbin/lsof | wc -l
32768
3589

Any other thoughts?

Rod

On 2/20/10 Saturday, February 20, 20101:58 PM, "Dan Washusen"
<dan@reactive.org> wrote:

> Sorry I can't be more helpful but just to double check it's not a file
> limits issue could you run the following on each of the hosts:
> 
> $ ulimit -a
> $ lsof | wc -l
> 
> The first command will show you (among other things) the file limits, it
> should be above the default 1024.  The second will tell you have many files
> are currently open...
> 
> Cheers,
> Dan
> 
> On 21 February 2010 03:14, Rod Cope <rod.cope@openlogic.com> wrote:
> 
>> I¹ve been loading some large data sets over the last week or so, but keep
>> running into failures between 4 and 15 hours into the process.  I¹ve wiped
>> HBase and/or HDFS a few times hoping that would help, but it hasn¹t.
>> 
>> I¹ve implemented all the recommendations for increasing file limits and the
>> like on the troubleshooting wiki page.  There¹s plenty of free disk space
>> and memory with no swap being used on any of the 9 machines in the cluster.
>> All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading
>> data from HDFS and NFS-mounted disk into HBase.  Doing a zk_dump shows an
>> average of 1 for all machines with the highest max being 621.  The
>> regionserver having trouble varies from load to load, so the problem
>> doesn¹t
>> appear to be machine-specific.
>> 
>> You can see in the logs below that a compaction is started which leads to a
>> LeaseExpiredException: File does not exist (I¹ve done a hadoop ­get and
>> it¹s
>> really not there).  Then an Error Recovery for a block, compaction/split
>> fail, ³Premeture EOF from inputStream²,  ³No live nodes contain current
>> block², and finally ³Cannot open filename².  At this point, there¹s a
>> meltdown where the vast majority of the rest of the log is filled with
>> exceptions like these back to back.  The regionserver doesn¹t go down,
>> however.
>> 
>> I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4).
>> I upgraded Hadoop from 0.20.1 hoping that would help some of the problems
>> I¹ve been having, but it only seemed to change the details of the
>> exceptions
>> and not the results.  Once I upgraded to Hadoop 0.20.2, I replaced HBase's
>> hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar.
>> 
>> Any ideas?  I¹m really under the gun to get this data loaded, so any
>> workarounds or other recommendations are much appreciated.
>> 
>> Thanks,
>> Rod
>> 
>> ----
>> 
>> Here¹s a link to the logs below in case they¹re not easy to read:
>> http://pastebin.com/d7907bca
>> 
>> 
>> 2010-02-19 21:59:24,950 DEBUG
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> requested for region
>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>> \x7Csrc/svn/n/ne/n
>> 
>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606/
>> 25429292 because: Region has references on open
>> 2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> Starting compaction on region
>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>> \x7Csrc/svn/n/ne/n
>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
>> 2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> Started compaction of 4 file(s), hasReferences=true, into
>> /hbase/files/compaction.dir/25429292, seqid=2811972
>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>> Exception: org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not
>> exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
>>        at
>> 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
>> java:1332)
>>      (...rest of stack trace...)
>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes ==
>> null
>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not
>> get
>> block locations. Source file
>> "/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting...
>> 2010-02-19 21:59:27,997 ERROR
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>> failed for region
>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>> \x7Csrc/svn/n/ne/n
>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not
>> exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
>>        at
>> 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
>> java:1332)
>>      (...rest of stack trace...)
>> 2010-02-19 22:00:23,627 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB
>> (835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745,
>> Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss
>> Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625
>> 2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-5162944092610390422_253522 from any node:
>> java.io.IOException: No live nodes contain current block
>> 2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-5162944092610390422_253522 from any node:
>> java.io.IOException: No live nodes contain current block
>> 2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
>> java.io.IOException: Cannot open filename
>> /hbase/files/929080390/metadata/6217150884710004337
>>        at
>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497
>> )
>>      (...rest of stack trace...)
>> 2010-02-19 22:00:47,994 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException:
>> Premeture EOF from inputStream
>>        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>      (...rest of stack trace...)
>> 2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
>> handler 76 on 60020, call get([B@3a73f53,
>> row=netbeans|
>> https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n
>> etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1,
>> timeRange=[0,9223372036854775807), families={(family=metadata,
>> columns={updated_at}}) from 192.168.60.106:45445: error:
>> java.io.IOException: Premeture EOF from inputStream
>> java.io.IOException: Premeture EOF from inputStream
>>      (...rest of stack trace...)
>> 2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-5162944092610390422_253522 from any node:
>> java.io.IOException: No live nodes contain current block
>> 2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-5162944092610390422_253522 from any node:
>> java.io.IOException: No live nodes contain current block
>> 2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>> Flush requested on
>> files,python\x7Chttps://olex.openlogic.com/packages/python
>> \x7Csrc/archive/p/
>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>> 2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>> Started memstore flush for region
>> files,python\x7Chttps://olex.openlogic.com/packages/python
>> \x7Csrc/archive/p/
>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current
>> region memstore size 64.1m
>> 2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264,
>> entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to
>> files,python\x7Chttps://olex.openlogic.com/packages/python
>> \x7Csrc/archive/p/
>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>> 2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> Added
>> hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200,
>> entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to
>> files,python\x7Chttps://olex.openlogic.com/packages/python
>> \x7Csrc/archive/p/
>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>> 2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
>> java.io.IOException: Cannot open filename
>> /hbase/files/929080390/metadata/6217150884710004337
>>        at
>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497
>> )
>>      (...rest of stack trace...)
>> 


-- 

Rod Cope | CTO and Founder
rod.cope@openlogic.com
Follow me on Twitter @RodCope

720 240 4501    |  phone
720 240 4557    |  fax
1 888 OpenLogic    |  toll free
www.openlogic.com 
Follow OpenLogic on Twitter @openlogic






Mime
View raw message