hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <lars.geo...@gmail.com>
Subject Re: Constant error when putting large data into HBase
Date Thu, 01 Dec 2011 09:16:33 GMT
Hi Ed,

You need to be more precise I am afraid. First of all what does "some node always dies" mean?
Is the process gone? Which process is gone?
And the "error" you pasted is a WARN level log that *might* indicate some trouble, but is
*not* the reason the "node has died". Please elaborate.

Also consider posting the last few hundred lines of the process logs to pastebin so that someone
can look at it.

Thanks,
Lars


On Dec 1, 2011, at 9:48 AM, edward choi wrote:

> Hi,
> I've had a problem that has been killing for some days now.
> I am using CDH3 update2 version of Hadoop and Hbase.
> When I do a large amount of bulk loading into Hbase, some node always die.
> It's not just one particular node.
> But one of many nodes fail to serve eventually.
> 
> I set 4 gigs of heap space for master, and regionservers. I monitored the
> process and when any node fails, it has not used all the heaps yet.
> So it is not a heap space problem.
> 
> Below is what I get when I perform bulk put using MapReduce.
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 11/12/01 17:17:20 INFO mapred.JobClient:  map 100% reduce 100%
> 11/12/01 17:18:31 INFO mapred.JobClient: Task Id :
> attempt_201111302113_0034_r_000013_0, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 1 action: servers with issues: lp171.etri.re.kr:60020,
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>        at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>        at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>        at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>        at etri.qa.mapreduce.PostProcess$PostPro
> attempt_201111302113_0034_r_000013_0: 20111122
> 11/12/01 17:18:36 INFO mapred.JobClient: Task Id :
> attempt_201111302113_0034_r_000013_1, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 1 action: servers with issues: lp171.etri.re.kr:60020,
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>        at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>        at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>        at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>        at etri.qa.mapreduce.PostProcess$PostPro
> attempt_201111302113_0034_r_000013_1: 20111122
> 11/12/01 17:18:37 INFO mapred.JobClient:  map 100% reduce 95%
> 11/12/01 17:18:44 INFO mapred.JobClient:  map 100% reduce 96%
> 11/12/01 17:18:47 INFO mapred.JobClient:  map 100% reduce 98%
> 11/12/01 17:18:50 INFO mapred.JobClient:  map 100% reduce 99%
> 11/12/01 17:18:53 INFO mapred.JobClient:  map 100% reduce 100%
> 11/12/01 17:20:07 INFO mapred.JobClient: Task Id :
> attempt_201111302113_0034_r_000013_3, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 1 action: servers with issues: lp171.etri.re.kr:60020,
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>        at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>        at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>        at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>        at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>        at etri.qa.mapreduce.PostProcess$PostPro
> attempt_201111302113_0034_r_000013_3: 20111122
> 11/12/01 17:20:09 INFO mapred.JobClient:  map 100% reduce 95%
> 11/12/01 17:20:09 INFO mapred.JobClient: Job complete: job_201111302113_0034
> 11/12/01 17:20:09 INFO mapred.JobClient: Counters: 23
> 11/12/01 17:20:09 INFO mapred.JobClient:   Job Counters
> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched reduce tasks=34
> 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4701750
> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Rack-local map tasks=54
> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched map tasks=310
> 11/12/01 17:20:09 INFO mapred.JobClient:     Data-local map tasks=256
> 11/12/01 17:20:09 INFO mapred.JobClient:     Failed reduce tasks=1
> 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=3043956
> 11/12/01 17:20:09 INFO mapred.JobClient:   FileSystemCounters
> 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_READ=1120
> 11/12/01 17:20:09 INFO mapred.JobClient:     HDFS_BYTES_READ=36392
> 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69076396
> 11/12/01 17:20:09 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input groups=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine output records=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Map input records=2187386
> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce shuffle bytes=289536
> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce output records=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Spilled Records=2187386
> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output bytes=91870212
> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine input records=0
> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output records=2187386
> 11/12/01 17:20:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=36392
> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input records=0
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> So I looked up the log of the node that has failed me.
> Below is the cause of failure.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2011-12-01 17:07:37,968 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=79.73 MB, total=597.89 MB, single=372.17 MB, multi=298.71
> MB, memory=0 KB
> 2011-12-01 17:07:39,721 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 79.7 MB of total=677.2 MB
> 2011-12-01 17:07:39,737 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=79.71 MB, total=598.06 MB, single=372.2 MB, multi=298.71
> MB, memory=0 KB
> 2011-12-01 17:07:41,414 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 79.71 MB of total=677.21 MB
> 2011-12-01 17:07:41,418 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=79.72 MB, total=597.82 MB, single=372.16 MB, multi=298.71
> MB, memory=0 KB
> 2011-12-01 17:07:41,849 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60020: readAndProcess threw exception java.io.IOException:
> Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-12-01 17:10:41,509 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> 8383367466793958870 lease expired
> 2011-12-01 17:11:06,816 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=601.74 MB,
> free=194.94 MB, max=796.67 MB, blocks=9281, accesses=3415277, hits=106627,
> hitRatio=3.12%%, cachingAccesses=2734012, cachingHits=88768,
> cachingHitsRatio=3.24%%, evictions=2227, evicted=2635963,
> evictedPerRun=1183.6385498046875
> 2011-12-01 17:13:33,457 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 79.72 MB of total=677.22 MB
> 2011-12-01 17:13:33,461 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=79.75 MB, total=598.11 MB, single=372.35 MB, multi=298.71
> MB, memory=0 KB
> 2011-12-01 17:13:34,169 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 79.71 MB of total=677.22 MB
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> I searched google and the most prominent answer I got was the small number
> of Zookeeper connections allowed. But since I am using Hbase 0.90.4, I
> thought that was not my case.
> Could anyone please enlighten me with this problem?
> 
> Regards,
> Ed


Mime
View raw message