hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <mp2...@gmail.com>
Subject Re: Constant error when putting large data into HBase
Date Thu, 01 Dec 2011 11:21:06 GMT
Hi Lars,

Okay here goes some details.
There are 21 tasktrackers/datanodes/regionservers
There is one Jobtracker/namenode/master
Three zookeepers.

There are about 200 million tweets in Hbase.
My mapreduce code is to aggregate tweets by their generated date.
So in the map stage, I write out tweet date as the key, and document id as
the value (document id is randomly generated by hash algorithm)
In the reduce stage, I put the data into a table. The key(which is the
tweet date) is the table rowid, and values(which are document id's) as the
column values.

Now, map stage is fine. I get to 100% map. But during reduce stage, one of
my regionserver fails.
I don't know what the exact symptom is. I just get:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed
> 1 action: servers with issues: lp171.etri.re.kr:60020,

About "some node always die" <== scratch this.

To be precise,
I narrowed down the range of data that I wanted to process.
I tried to put tweets that was generated only at 2011/11/22.
Now the reduce code will produce a row with "20111122" as the rowid, and a
bunch of document id's as the column value. (I use 32byte string as the
document id. I append 1000 document id for a single column)
So the region that my data will be inserted will have "20111122" between
the Start Key and End Key.
The regionserver that contains that specific region fails. That is the
point. If I move that region to another regionserver using hbase shell,
then that regionserver fails.
With the same log output.
After 4 failures, the job is force-cancelled and the put operation was not
done.

Now, even with the failure, the regionserver is still online. It is not
dead(sorry for my use of word 'die').

I have pasted Jobtracker log, tasktracker(one that failed) log,
regionserver(one that failed) log using PasteBin.
The job started at 2011-12-01 17:14:43 and was killed at 2011-12-01
17:20:07.

JobTracker Log
<script src="http://pastebin.com/embed_js.php?i=n6sp8Fyi"></script>

TaskTracker Log
<script src="http://pastebin.com/embed_js.php?i=RMFc41D5"></script>

RegionServer Log
<script src="http://pastebin.com/embed_js.php?i=UpKF8HwN"></script>

And finally, according to the logs I pasted, I see other lines with DEBUG
or INFO. So I thought this was okay.
Is there a way to change WARN level log to some other level log? If you'd
let me know, I will paste another set of logs.

Thanks,
Ed

2011/12/1 Lars George <lars.george@gmail.com>

> Hi Ed,
>
> You need to be more precise I am afraid. First of all what does "some node
> always dies" mean? Is the process gone? Which process is gone?
> And the "error" you pasted is a WARN level log that *might* indicate some
> trouble, but is *not* the reason the "node has died". Please elaborate.
>
> Also consider posting the last few hundred lines of the process logs to
> pastebin so that someone can look at it.
>
> Thanks,
> Lars
>
>
> On Dec 1, 2011, at 9:48 AM, edward choi wrote:
>
> > Hi,
> > I've had a problem that has been killing for some days now.
> > I am using CDH3 update2 version of Hadoop and Hbase.
> > When I do a large amount of bulk loading into Hbase, some node always
> die.
> > It's not just one particular node.
> > But one of many nodes fail to serve eventually.
> >
> > I set 4 gigs of heap space for master, and regionservers. I monitored the
> > process and when any node fails, it has not used all the heaps yet.
> > So it is not a heap space problem.
> >
> > Below is what I get when I perform bulk put using MapReduce.
> >
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > 11/12/01 17:17:20 INFO mapred.JobClient:  map 100% reduce 100%
> > 11/12/01 17:18:31 INFO mapred.JobClient: Task Id :
> > attempt_201111302113_0034_r_000013_0, Status : FAILED
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed
> > 1 action: servers with issues: lp171.etri.re.kr:60020,
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >        at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >        at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >        at etri.qa.mapreduce.PostProcess$PostPro
> > attempt_201111302113_0034_r_000013_0: 20111122
> > 11/12/01 17:18:36 INFO mapred.JobClient: Task Id :
> > attempt_201111302113_0034_r_000013_1, Status : FAILED
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed
> > 1 action: servers with issues: lp171.etri.re.kr:60020,
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >        at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >        at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >        at etri.qa.mapreduce.PostProcess$PostPro
> > attempt_201111302113_0034_r_000013_1: 20111122
> > 11/12/01 17:18:37 INFO mapred.JobClient:  map 100% reduce 95%
> > 11/12/01 17:18:44 INFO mapred.JobClient:  map 100% reduce 96%
> > 11/12/01 17:18:47 INFO mapred.JobClient:  map 100% reduce 98%
> > 11/12/01 17:18:50 INFO mapred.JobClient:  map 100% reduce 99%
> > 11/12/01 17:18:53 INFO mapred.JobClient:  map 100% reduce 100%
> > 11/12/01 17:20:07 INFO mapred.JobClient: Task Id :
> > attempt_201111302113_0034_r_000013_3, Status : FAILED
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed
> > 1 action: servers with issues: lp171.etri.re.kr:60020,
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >        at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >        at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >        at etri.qa.mapreduce.PostProcess$PostPro
> > attempt_201111302113_0034_r_000013_3: 20111122
> > 11/12/01 17:20:09 INFO mapred.JobClient:  map 100% reduce 95%
> > 11/12/01 17:20:09 INFO mapred.JobClient: Job complete:
> job_201111302113_0034
> > 11/12/01 17:20:09 INFO mapred.JobClient: Counters: 23
> > 11/12/01 17:20:09 INFO mapred.JobClient:   Job Counters
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Launched reduce tasks=34
> > 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4701750
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all
> > reduces waiting after reserving slots (ms)=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Rack-local map tasks=54
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Launched map tasks=310
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Data-local map tasks=256
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Failed reduce tasks=1
> > 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=3043956
> > 11/12/01 17:20:09 INFO mapred.JobClient:   FileSystemCounters
> > 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_READ=1120
> > 11/12/01 17:20:09 INFO mapred.JobClient:     HDFS_BYTES_READ=36392
> > 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69076396
> > 11/12/01 17:20:09 INFO mapred.JobClient:   Map-Reduce Framework
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input groups=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Combine output records=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Map input records=2187386
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce shuffle bytes=289536
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce output records=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Spilled Records=2187386
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Map output bytes=91870212
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Combine input records=0
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Map output records=2187386
> > 11/12/01 17:20:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=36392
> > 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input records=0
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > So I looked up the log of the node that has failed me.
> > Below is the cause of failure.
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > 2011-12-01 17:07:37,968 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=79.73 MB, total=597.89 MB, single=372.17 MB,
> multi=298.71
> > MB, memory=0 KB
> > 2011-12-01 17:07:39,721 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 79.7 MB of total=677.2 MB
> > 2011-12-01 17:07:39,737 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=79.71 MB, total=598.06 MB, single=372.2 MB, multi=298.71
> > MB, memory=0 KB
> > 2011-12-01 17:07:41,414 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 79.71 MB of total=677.21 MB
> > 2011-12-01 17:07:41,418 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=79.72 MB, total=597.82 MB, single=372.16 MB,
> multi=298.71
> > MB, memory=0 KB
> > 2011-12-01 17:07:41,849 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > listener on 60020: readAndProcess threw exception java.io.IOException:
> > Connection reset by peer. Count of bytes read: 0
> > java.io.IOException: Connection reset by peer
> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> >        at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:662)
> > 2011-12-01 17:10:41,509 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> > 8383367466793958870 lease expired
> > 2011-12-01 17:11:06,816 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=601.74
> MB,
> > free=194.94 MB, max=796.67 MB, blocks=9281, accesses=3415277,
> hits=106627,
> > hitRatio=3.12%%, cachingAccesses=2734012, cachingHits=88768,
> > cachingHitsRatio=3.24%%, evictions=2227, evicted=2635963,
> > evictedPerRun=1183.6385498046875
> > 2011-12-01 17:13:33,457 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 79.72 MB of total=677.22 MB
> > 2011-12-01 17:13:33,461 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=79.75 MB, total=598.11 MB, single=372.35 MB,
> multi=298.71
> > MB, memory=0 KB
> > 2011-12-01 17:13:34,169 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 79.71 MB of total=677.22 MB
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > I searched google and the most prominent answer I got was the small
> number
> > of Zookeeper connections allowed. But since I am using Hbase 0.90.4, I
> > thought that was not my case.
> > Could anyone please enlighten me with this problem?
> >
> > Regards,
> > Ed
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message