hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <lars.geo...@gmail.com>
Subject Re: Constant error when putting large data into HBase
Date Thu, 01 Dec 2011 13:38:04 GMT
Hi Ed,

Without having looked at the logs, this sounds like the common case of overloading a single
region due to your sequential row keys. Either hash the keys, or salt them - but the best
bet here is to use the bulk loading feature of HBase (http://hbase.apache.org/bulk-loads.html).
That bypasses this problem and lets you continue to use sequential keys.

Lars


On Dec 1, 2011, at 12:21 PM, edward choi wrote:

> Hi Lars,
> 
> Okay here goes some details.
> There are 21 tasktrackers/datanodes/regionservers
> There is one Jobtracker/namenode/master
> Three zookeepers.
> 
> There are about 200 million tweets in Hbase.
> My mapreduce code is to aggregate tweets by their generated date.
> So in the map stage, I write out tweet date as the key, and document id as
> the value (document id is randomly generated by hash algorithm)
> In the reduce stage, I put the data into a table. The key(which is the
> tweet date) is the table rowid, and values(which are document id's) as the
> column values.
> 
> Now, map stage is fine. I get to 100% map. But during reduce stage, one of
> my regionserver fails.
> I don't know what the exact symptom is. I just get:
>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed
>> 1 action: servers with issues: lp171.etri.re.kr:60020,
> 
> About "some node always die" <== scratch this.
> 
> To be precise,
> I narrowed down the range of data that I wanted to process.
> I tried to put tweets that was generated only at 2011/11/22.
> Now the reduce code will produce a row with "20111122" as the rowid, and a
> bunch of document id's as the column value. (I use 32byte string as the
> document id. I append 1000 document id for a single column)
> So the region that my data will be inserted will have "20111122" between
> the Start Key and End Key.
> The regionserver that contains that specific region fails. That is the
> point. If I move that region to another regionserver using hbase shell,
> then that regionserver fails.
> With the same log output.
> After 4 failures, the job is force-cancelled and the put operation was not
> done.
> 
> Now, even with the failure, the regionserver is still online. It is not
> dead(sorry for my use of word 'die').
> 
> I have pasted Jobtracker log, tasktracker(one that failed) log,
> regionserver(one that failed) log using PasteBin.
> The job started at 2011-12-01 17:14:43 and was killed at 2011-12-01
> 17:20:07.
> 
> JobTracker Log
> <script src="http://pastebin.com/embed_js.php?i=n6sp8Fyi"></script>
> 
> TaskTracker Log
> <script src="http://pastebin.com/embed_js.php?i=RMFc41D5"></script>
> 
> RegionServer Log
> <script src="http://pastebin.com/embed_js.php?i=UpKF8HwN"></script>
> 
> And finally, according to the logs I pasted, I see other lines with DEBUG
> or INFO. So I thought this was okay.
> Is there a way to change WARN level log to some other level log? If you'd
> let me know, I will paste another set of logs.
> 
> Thanks,
> Ed
> 
> 2011/12/1 Lars George <lars.george@gmail.com>
> 
>> Hi Ed,
>> 
>> You need to be more precise I am afraid. First of all what does "some node
>> always dies" mean? Is the process gone? Which process is gone?
>> And the "error" you pasted is a WARN level log that *might* indicate some
>> trouble, but is *not* the reason the "node has died". Please elaborate.
>> 
>> Also consider posting the last few hundred lines of the process logs to
>> pastebin so that someone can look at it.
>> 
>> Thanks,
>> Lars
>> 
>> 
>> On Dec 1, 2011, at 9:48 AM, edward choi wrote:
>> 
>>> Hi,
>>> I've had a problem that has been killing for some days now.
>>> I am using CDH3 update2 version of Hadoop and Hbase.
>>> When I do a large amount of bulk loading into Hbase, some node always
>> die.
>>> It's not just one particular node.
>>> But one of many nodes fail to serve eventually.
>>> 
>>> I set 4 gigs of heap space for master, and regionservers. I monitored the
>>> process and when any node fails, it has not used all the heaps yet.
>>> So it is not a heap space problem.
>>> 
>>> Below is what I get when I perform bulk put using MapReduce.
>>> 
>>> 
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 11/12/01 17:17:20 INFO mapred.JobClient:  map 100% reduce 100%
>>> 11/12/01 17:18:31 INFO mapred.JobClient: Task Id :
>>> attempt_201111302113_0034_r_000013_0, Status : FAILED
>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed
>>> 1 action: servers with issues: lp171.etri.re.kr:60020,
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>>>       at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>>>       at
>>> 
>> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>>>       at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>       at etri.qa.mapreduce.PostProcess$PostPro
>>> attempt_201111302113_0034_r_000013_0: 20111122
>>> 11/12/01 17:18:36 INFO mapred.JobClient: Task Id :
>>> attempt_201111302113_0034_r_000013_1, Status : FAILED
>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed
>>> 1 action: servers with issues: lp171.etri.re.kr:60020,
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>>>       at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>>>       at
>>> 
>> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>>>       at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>       at etri.qa.mapreduce.PostProcess$PostPro
>>> attempt_201111302113_0034_r_000013_1: 20111122
>>> 11/12/01 17:18:37 INFO mapred.JobClient:  map 100% reduce 95%
>>> 11/12/01 17:18:44 INFO mapred.JobClient:  map 100% reduce 96%
>>> 11/12/01 17:18:47 INFO mapred.JobClient:  map 100% reduce 98%
>>> 11/12/01 17:18:50 INFO mapred.JobClient:  map 100% reduce 99%
>>> 11/12/01 17:18:53 INFO mapred.JobClient:  map 100% reduce 100%
>>> 11/12/01 17:20:07 INFO mapred.JobClient: Task Id :
>>> attempt_201111302113_0034_r_000013_3, Status : FAILED
>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed
>>> 1 action: servers with issues: lp171.etri.re.kr:60020,
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
>>>       at
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
>>>       at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
>>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
>>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
>>>       at
>>> 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
>>>       at
>>> 
>> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
>>>       at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>       at etri.qa.mapreduce.PostProcess$PostPro
>>> attempt_201111302113_0034_r_000013_3: 20111122
>>> 11/12/01 17:20:09 INFO mapred.JobClient:  map 100% reduce 95%
>>> 11/12/01 17:20:09 INFO mapred.JobClient: Job complete:
>> job_201111302113_0034
>>> 11/12/01 17:20:09 INFO mapred.JobClient: Counters: 23
>>> 11/12/01 17:20:09 INFO mapred.JobClient:   Job Counters
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched reduce tasks=34
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4701750
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all
>>> reduces waiting after reserving slots (ms)=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all maps
>>> waiting after reserving slots (ms)=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Rack-local map tasks=54
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched map tasks=310
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Data-local map tasks=256
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Failed reduce tasks=1
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=3043956
>>> 11/12/01 17:20:09 INFO mapred.JobClient:   FileSystemCounters
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_READ=1120
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     HDFS_BYTES_READ=36392
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69076396
>>> 11/12/01 17:20:09 INFO mapred.JobClient:   Map-Reduce Framework
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input groups=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine output records=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map input records=2187386
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce shuffle bytes=289536
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce output records=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Spilled Records=2187386
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output bytes=91870212
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine input records=0
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output records=2187386
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=36392
>>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input records=0
>>> 
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> So I looked up the log of the node that has failed me.
>>> Below is the cause of failure.
>>> 
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 2011-12-01 17:07:37,968 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> completed; freed=79.73 MB, total=597.89 MB, single=372.17 MB,
>> multi=298.71
>>> MB, memory=0 KB
>>> 2011-12-01 17:07:39,721 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> started; Attempting to free 79.7 MB of total=677.2 MB
>>> 2011-12-01 17:07:39,737 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> completed; freed=79.71 MB, total=598.06 MB, single=372.2 MB, multi=298.71
>>> MB, memory=0 KB
>>> 2011-12-01 17:07:41,414 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> started; Attempting to free 79.71 MB of total=677.21 MB
>>> 2011-12-01 17:07:41,418 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> completed; freed=79.72 MB, total=597.82 MB, single=372.16 MB,
>> multi=298.71
>>> MB, memory=0 KB
>>> 2011-12-01 17:07:41,849 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>>> listener on 60020: readAndProcess threw exception java.io.IOException:
>>> Connection reset by peer. Count of bytes read: 0
>>> java.io.IOException: Connection reset by peer
>>>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>>>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>>>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>>>       at
>>> 
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
>>>       at
>>> 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
>>>       at
>>> 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>>>       at
>>> 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>>>       at
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>       at
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>       at java.lang.Thread.run(Thread.java:662)
>>> 2011-12-01 17:10:41,509 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> 8383367466793958870 lease expired
>>> 2011-12-01 17:11:06,816 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=601.74
>> MB,
>>> free=194.94 MB, max=796.67 MB, blocks=9281, accesses=3415277,
>> hits=106627,
>>> hitRatio=3.12%%, cachingAccesses=2734012, cachingHits=88768,
>>> cachingHitsRatio=3.24%%, evictions=2227, evicted=2635963,
>>> evictedPerRun=1183.6385498046875
>>> 2011-12-01 17:13:33,457 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> started; Attempting to free 79.72 MB of total=677.22 MB
>>> 2011-12-01 17:13:33,461 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> completed; freed=79.75 MB, total=598.11 MB, single=372.35 MB,
>> multi=298.71
>>> MB, memory=0 KB
>>> 2011-12-01 17:13:34,169 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>>> started; Attempting to free 79.71 MB of total=677.22 MB
>>> 
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> I searched google and the most prominent answer I got was the small
>> number
>>> of Zookeeper connections allowed. But since I am using Hbase 0.90.4, I
>>> thought that was not my case.
>>> Could anyone please enlighten me with this problem?
>>> 
>>> Regards,
>>> Ed
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message