hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <mp2...@gmail.com>
Subject Re: Constant error when putting large data into HBase
Date Thu, 01 Dec 2011 13:58:26 GMT
Thanks Lars,
I am already familiar with the sequential key problem.
That is why I am using hash generated random string as the document id.
But I guess I was still pushing the cluster too much.

Maybe I am inserting tweet documents too fast?
Since a single tweet is only 140 bytes, puts are performed really fast.
So I am guessing maybe random keys alone are not cutting it..?

I am counting 20,000 requests per region when I perform mapreduce loading.
Is that too much to handle?

Is there a way to deliberately slow down input process?
I am reading from 21 node HDFS cluster and writing to 21 node HBase
cluster, so the process speed and the sheer volume of data transaction is
enormous.
Can I set a limit to the request per region? Say, like 5000 request maximum?
I really want to know just how far I can push Hbase.
But I guess developers would say everything depends on the use case.

I thought about using bulk loading feature but I kinda got lazy and just
went with the random string rowid.
If parameter meddling doesn't pan out, I'll have no choice but to try
bulk-loading feature.

Thanks for the reply.

Regards,
Ed


2011/12/1 Lars George <lars.george@gmail.com>

> Hi Ed,
>
> Without having looked at the logs, this sounds like the common case of
> overloading a single region due to your sequential row keys. Either hash
> the keys, or salt them - but the best bet here is to use the bulk loading
> feature of HBase (http://hbase.apache.org/bulk-loads.html). That bypasses
> this problem and lets you continue to use sequential keys.
>
> Lars
>
>
> On Dec 1, 2011, at 12:21 PM, edward choi wrote:
>
> > Hi Lars,
> >
> > Okay here goes some details.
> > There are 21 tasktrackers/datanodes/regionservers
> > There is one Jobtracker/namenode/master
> > Three zookeepers.
> >
> > There are about 200 million tweets in Hbase.
> > My mapreduce code is to aggregate tweets by their generated date.
> > So in the map stage, I write out tweet date as the key, and document id
> as
> > the value (document id is randomly generated by hash algorithm)
> > In the reduce stage, I put the data into a table. The key(which is the
> > tweet date) is the table rowid, and values(which are document id's) as
> the
> > column values.
> >
> > Now, map stage is fine. I get to 100% map. But during reduce stage, one
> of
> > my regionserver fails.
> > I don't know what the exact symptom is. I just get:
> >> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> > Failed
> >> 1 action: servers with issues: lp171.etri.re.kr:60020,
> >
> > About "some node always die" <== scratch this.
> >
> > To be precise,
> > I narrowed down the range of data that I wanted to process.
> > I tried to put tweets that was generated only at 2011/11/22.
> > Now the reduce code will produce a row with "20111122" as the rowid, and
> a
> > bunch of document id's as the column value. (I use 32byte string as the
> > document id. I append 1000 document id for a single column)
> > So the region that my data will be inserted will have "20111122" between
> > the Start Key and End Key.
> > The regionserver that contains that specific region fails. That is the
> > point. If I move that region to another regionserver using hbase shell,
> > then that regionserver fails.
> > With the same log output.
> > After 4 failures, the job is force-cancelled and the put operation was
> not
> > done.
> >
> > Now, even with the failure, the regionserver is still online. It is not
> > dead(sorry for my use of word 'die').
> >
> > I have pasted Jobtracker log, tasktracker(one that failed) log,
> > regionserver(one that failed) log using PasteBin.
> > The job started at 2011-12-01 17:14:43 and was killed at 2011-12-01
> > 17:20:07.
> >
> > JobTracker Log
> > <script src="http://pastebin.com/embed_js.php?i=n6sp8Fyi"></script>
> >
> > TaskTracker Log
> > <script src="http://pastebin.com/embed_js.php?i=RMFc41D5"></script>
> >
> > RegionServer Log
> > <script src="http://pastebin.com/embed_js.php?i=UpKF8HwN"></script>
> >
> > And finally, according to the logs I pasted, I see other lines with DEBUG
> > or INFO. So I thought this was okay.
> > Is there a way to change WARN level log to some other level log? If you'd
> > let me know, I will paste another set of logs.
> >
> > Thanks,
> > Ed
> >
> > 2011/12/1 Lars George <lars.george@gmail.com>
> >
> >> Hi Ed,
> >>
> >> You need to be more precise I am afraid. First of all what does "some
> node
> >> always dies" mean? Is the process gone? Which process is gone?
> >> And the "error" you pasted is a WARN level log that *might* indicate
> some
> >> trouble, but is *not* the reason the "node has died". Please elaborate.
> >>
> >> Also consider posting the last few hundred lines of the process logs to
> >> pastebin so that someone can look at it.
> >>
> >> Thanks,
> >> Lars
> >>
> >>
> >> On Dec 1, 2011, at 9:48 AM, edward choi wrote:
> >>
> >>> Hi,
> >>> I've had a problem that has been killing for some days now.
> >>> I am using CDH3 update2 version of Hadoop and Hbase.
> >>> When I do a large amount of bulk loading into Hbase, some node always
> >> die.
> >>> It's not just one particular node.
> >>> But one of many nodes fail to serve eventually.
> >>>
> >>> I set 4 gigs of heap space for master, and regionservers. I monitored
> the
> >>> process and when any node fails, it has not used all the heaps yet.
> >>> So it is not a heap space problem.
> >>>
> >>> Below is what I get when I perform bulk put using MapReduce.
> >>>
> >>>
> >>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>> 11/12/01 17:17:20 INFO mapred.JobClient:  map 100% reduce 100%
> >>> 11/12/01 17:18:31 INFO mapred.JobClient: Task Id :
> >>> attempt_201111302113_0034_r_000013_0, Status : FAILED
> >>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> >> Failed
> >>> 1 action: servers with issues: lp171.etri.re.kr:60020,
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >>>       at
> >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >>>       at etri.qa.mapreduce.PostProcess$PostPro
> >>> attempt_201111302113_0034_r_000013_0: 20111122
> >>> 11/12/01 17:18:36 INFO mapred.JobClient: Task Id :
> >>> attempt_201111302113_0034_r_000013_1, Status : FAILED
> >>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> >> Failed
> >>> 1 action: servers with issues: lp171.etri.re.kr:60020,
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >>>       at
> >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >>>       at etri.qa.mapreduce.PostProcess$PostPro
> >>> attempt_201111302113_0034_r_000013_1: 20111122
> >>> 11/12/01 17:18:37 INFO mapred.JobClient:  map 100% reduce 95%
> >>> 11/12/01 17:18:44 INFO mapred.JobClient:  map 100% reduce 96%
> >>> 11/12/01 17:18:47 INFO mapred.JobClient:  map 100% reduce 98%
> >>> 11/12/01 17:18:50 INFO mapred.JobClient:  map 100% reduce 99%
> >>> 11/12/01 17:18:53 INFO mapred.JobClient:  map 100% reduce 100%
> >>> 11/12/01 17:20:07 INFO mapred.JobClient: Task Id :
> >>> attempt_201111302113_0034_r_000013_3, Status : FAILED
> >>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> >> Failed
> >>> 1 action: servers with issues: lp171.etri.re.kr:60020,
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
> >>>       at
> >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
> >>>       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
> >>>       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
> >>>       at
> >>>
> >>
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >>>       at etri.qa.mapreduce.PostProcess$PostPro
> >>> attempt_201111302113_0034_r_000013_3: 20111122
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:  map 100% reduce 95%
> >>> 11/12/01 17:20:09 INFO mapred.JobClient: Job complete:
> >> job_201111302113_0034
> >>> 11/12/01 17:20:09 INFO mapred.JobClient: Counters: 23
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:   Job Counters
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched reduce tasks=34
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4701750
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all
> >>> reduces waiting after reserving slots (ms)=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Total time spent by all
> maps
> >>> waiting after reserving slots (ms)=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Rack-local map tasks=54
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Launched map tasks=310
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Data-local map tasks=256
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Failed reduce tasks=1
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:
> SLOTS_MILLIS_REDUCES=3043956
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:   FileSystemCounters
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     FILE_BYTES_READ=1120
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     HDFS_BYTES_READ=36392
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=69076396
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:   Map-Reduce Framework
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input groups=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine output records=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map input records=2187386
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce shuffle
> bytes=289536
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce output records=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Spilled Records=2187386
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output bytes=91870212
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Combine input records=0
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Map output records=2187386
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=36392
> >>> 11/12/01 17:20:09 INFO mapred.JobClient:     Reduce input records=0
> >>>
> >>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>> So I looked up the log of the node that has failed me.
> >>> Below is the cause of failure.
> >>>
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>> 2011-12-01 17:07:37,968 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> completed; freed=79.73 MB, total=597.89 MB, single=372.17 MB,
> >> multi=298.71
> >>> MB, memory=0 KB
> >>> 2011-12-01 17:07:39,721 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> started; Attempting to free 79.7 MB of total=677.2 MB
> >>> 2011-12-01 17:07:39,737 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> completed; freed=79.71 MB, total=598.06 MB, single=372.2 MB,
> multi=298.71
> >>> MB, memory=0 KB
> >>> 2011-12-01 17:07:41,414 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> started; Attempting to free 79.71 MB of total=677.21 MB
> >>> 2011-12-01 17:07:41,418 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> completed; freed=79.72 MB, total=597.82 MB, single=372.16 MB,
> >> multi=298.71
> >>> MB, memory=0 KB
> >>> 2011-12-01 17:07:41,849 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> >> Server
> >>> listener on 60020: readAndProcess threw exception java.io.IOException:
> >>> Connection reset by peer. Count of bytes read: 0
> >>> java.io.IOException: Connection reset by peer
> >>>       at sun.nio.ch.FileDispatcher.read0(Native Method)
> >>>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >>>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> >>>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> >>>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
> >>>       at
> >>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
> >>>       at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>       at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>       at java.lang.Thread.run(Thread.java:662)
> >>> 2011-12-01 17:10:41,509 INFO
> >>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >>> 8383367466793958870 lease expired
> >>> 2011-12-01 17:11:06,816 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=601.74
> >> MB,
> >>> free=194.94 MB, max=796.67 MB, blocks=9281, accesses=3415277,
> >> hits=106627,
> >>> hitRatio=3.12%%, cachingAccesses=2734012, cachingHits=88768,
> >>> cachingHitsRatio=3.24%%, evictions=2227, evicted=2635963,
> >>> evictedPerRun=1183.6385498046875
> >>> 2011-12-01 17:13:33,457 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> started; Attempting to free 79.72 MB of total=677.22 MB
> >>> 2011-12-01 17:13:33,461 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> completed; freed=79.75 MB, total=598.11 MB, single=372.35 MB,
> >> multi=298.71
> >>> MB, memory=0 KB
> >>> 2011-12-01 17:13:34,169 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
> eviction
> >>> started; Attempting to free 79.71 MB of total=677.22 MB
> >>>
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>> I searched google and the most prominent answer I got was the small
> >> number
> >>> of Zookeeper connections allowed. But since I am using Hbase 0.90.4, I
> >>> thought that was not my case.
> >>> Could anyone please enlighten me with this problem?
> >>>
> >>> Regards,
> >>> Ed
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message