hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Reducer failed with TableOutputFormat
Date Fri, 05 Feb 2010 06:43:56 GMT
On Wed, Feb 3, 2010 at 6:37 PM, Victor Hsieh <victorhsieh@gmail.com> wrote:
> Hi,
>
> I'm running MapReduce with TableOutputFormat, and seems to have
> similar trouble as HBASE-1603.  After digging to the log, I realized
> that the time when my reducer failed might possibly related to region
> "callgraph,0958HGCAKZ,1265213010477" split.  Could you guys help to
> check it out?

Yeah, looks similiar.

This is suspicious:

> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
> 2010-02-04 00:38:35,492 INFO

Can you tell what the above was about?  Lots of files to compact or
was it just that the machine was super loaded?  Can you lightening the
loading on hbase?  How many concurrent reducers running at any one
time?

Here's another indicator of heavy loading:

> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
> sequence id=55264870, compaction requested=true

Thats 27 seconds doing the flush closing region as part of the split
operation (see how writes are disabled just above it).  Its only
writing 54M.  Shouldn't be taking that long.

We're at 00:39:02 when we we close parent region.

Says split took 2 minutes.  Thats a long time.  It looks like machines
are heavily loaded.  Are you running tasktrackers beside the
datanodes.  Are they stealing resources from the datanodes?  Is your
mapreduce job i/o heavy?

St.Ack

>
> Help appreciated,
> Victor
>
> PS. Here is the log of reducer, hbase master and region server around
> the failure time (00:39:42).
>
> ==== Reducer log ====
> 2010-02-04 00:39:42,447 WARN org.apache.hadoop.mapred.TaskTracker:
> Error running child
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server Some server, retryOnlyOne=true, index=0,
> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
> region=callgraph,0958HGCAKZ,1265213010477 for region
> callgraph,0958HGCAKZ,1265213010477, row '0988JHE2YS', but failed after
> 10 attempts.
> ...
>
> 2010-02-04 00:39:44,769
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server Some server, retryOnlyOne=true, index=0,
> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
> region=callgraph,0958HGCAKZ,1265213010477 for region
> callgraph,0958HGCAKZ,1265213010477, row '0972DHJAM2', but failed after
> 10 attempts.
> Exceptions:
>
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474)
>        at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94)
>        at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55)
>        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>        ....
>
> ==== HBase Master log ====
>
> 2010-02-04 00:38:35,492 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
> 2010-02-04 00:38:35,492 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
> callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
> outstanding on region callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
> size 54.3m
> 2010-02-04 00:39:02,542 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Added
> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
> callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:39:02,542 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
> sequence id=55264870, compaction requested=true
> 2010-02-04 00:39:02,543 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed pointee
> 2010-02-04 00:39:02,544 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> callgraph,0958HGCAKZ,1265213010477
> ...
> 2010-02-04 00:40:35,144 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
> META updated, and report to master all successful. Old region=REGION
> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
> 2010-02-04 00:40:35,144 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>
> ==== HBase Region Server log ====
>
> 2010-02-04 00:38:35,492 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
> 2010-02-04 00:38:35,492 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
> callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
> outstanding on region callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:38:35,616 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
> size 54.3m
> 2010-02-04 00:39:02,542 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Added
> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
> callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:39:02,542 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
> sequence id=55264870, compaction requested=true
> 2010-02-04 00:39:02,543 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed pointee
> 2010-02-04 00:39:02,544 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> callgraph,0958HGCAKZ,1265213010477
> 2010-02-04 00:40:35,144 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
> META updated, and report to master all successful. Old region=REGION
> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
> 2010-02-04 00:40:35,144 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>

Mime
View raw message