hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Task attempt failed to report status
Date Sun, 07 Mar 2010 05:38:02 GMT
You can introduce a second thread in the reducer which periodically
reports status to hadoop.
At the same time, you can record the longest put operation to see the
amount of time it takes.

lowering the number of cells in a put to some value under 1000 may
help as well.

On Saturday, March 6, 2010, steven zhuang <steven.zhuang.1984@gmail.com> wrote:
> thanks stack.
>
> I think the timeout may caused by 1. HDFS is slow 2. the single row is way
> to big(millions of cells, around 50-100MB), but I don't know clearly how it
> happened.
>
> I have checked the regionserver log, there were lots of WARN message like:
> "2010-03-05 01:44:22,881 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
> Server handler 16 on 60020 took 1386ms appending an edit to hlog;
> editcount=223222"
> I have googled some, still not very clear, does this mean it takes 1086ms to
> write 223222 bytes to HDFS(disk), right? and this is really slow, right?
>
> Another error message is(output error when doing Put):
>
> 2010-03-05 08:19:58,162 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call put([B@112e6ef,
> [Lorg.apache.hadoop.hbase.client.Put;@1f77c71) from 192.168.8.228:35324:
> output error
> 2010-03-05 08:19:58,349 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
>     at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> I saw Jean-Daniel
> <http://old.nabble.com/user/UserProfile.jtp?user=1325624> answered
> this in one thread, says that this is because the client gets killed while
> the region server is doing the action. I think in my case the error is
> raised when Reducer is killed(600 time-out) while the regionserver is still
> trying to commit data to HDFS.  pity I didn't find any line related to the
> task tracker on which the last reduce attempt timeouted.
>
> the timeout is 600 seconds, can anything block so long? I mean the
> disk-swap/write cannot take so long.
> what exactly happen when reducer do a Put:
> 1. sending data to regionserver,
> 2. region server update the data in memory,
> 3. region server flush the data to HDFS if some threshold is reached?
>
> there should be no splitting as in my case one row is really big that one
> region can only hold one row.
>
>
> comments answered in line:
>
> On Sun, Mar 7, 2010 at 6:27 AM, Stack <stack@duboce.net> wrote:
>
>> On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
>> <steven.zhuang.1984@gmail.com> wrote:
>> >     when I import data into the HTable with a Map/Reduce job, the task
>> runs
>> > smoothly until the last reducer failed 6 times to report its status.
>>
>> How many reducers?  All completed except this last one and it failed
>> inspite of 6 attempts?
>>
>
> the number of reducers is set by the following code:
>         "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
> c);"
> there are 70 reducers, but the first 60 reducers finished very fast, took no
> more than 10 minutes. the last few reducers took longer, around 20 minutes
> except that the last one, took 6 retries and failed at the end.
>
>
>
>> Perhaps its a really fat Put that is holding things up?  Can you add
>> logging of put sizes or some such to see if it an anomalous record
>> that is causing the non-reporting after ten minutes?
>
>
> I don't know how much data in a Put can make it fat, but at present I put
> 1000 cells in one put, each column-id is a url and the cell value is less
> than 10 bytes, makes a Put contains about 50KB data.
>
>>     In my program I use batchupdate to collect for every 1000 cells, and
>
>> > update the status. I don't think the normal inserting will cost 10
>> minutes,
>> > coz the first 99% of the job was smoothly done, only the very last
>> reducer
>> > will get the "fail to report status" error.
>> >
>>
>> Can you add logging to your reducers?  Log each put?  Try and see
>> where its hanging for > 10 minutes?
>
>
> yeah, that's done, I output a line to stderr before the Put and after a Put,
> but this doesn't help much, I can only see the last
> line saying it's committing, but never get the done message. the log is
> something like the  following:
>
> Reducer is committing row:bm_2010/01/18 col:queries:xxxxx:1:1:1:1 Time:Fri
> Mar 05 08:24:37 GMT 2010 Reducer has done committing row:bm_2010/01/18
> col:queries:yyyyyyy:1:1:1:1 Time:Fri Mar 05 08:24:37 GMT 2010 Reducer is
> committing row:cp_2010/01/18 col:queries:uuuuuuuuuu:2:2:2:2 Time:Fri Mar 05
> 08:24:37 GMT 2010
> it hangs here until the MR task is killed.
>
>
>> >     I doubt the problem is caused by regoinserver is way too busy, which
>> > causes the "output.collect(k, bu);" takes too much time to return. but I
>> am
>> > not sure coz I don't know which regionserver is actually committing the
>> > update.
>> >
>> >    So which log should I dig into? any hint is appreciated.
>> >
>>
>> Well, a reducer is responsible for a portion of the rows only,
>> usually.  MR is sorting on row?  So what arrives at the reducer is
>> sorted?  When this last reducer is running, look at UI?  It's probably
>> going to one regionserver only?  If you emit whats being inserted,
>> perhaps you can see from the row what region its trying to go too...
>> See where its hosted and look at that regionservers logs?
>>
>>
> MR is sorting on row, and yes, I think for the last reducer there is only
> one row, one row is going to one regionserver, I have checked the GUI of
> Hbase master after the job is done(killed), the region holds the failed row
> doesn't appear on
> the page.
>
>>    My code's  submitting portion is as follows(just copied from some
>> online
>> > source and changed a little):
>> >
>> >        public void reduce(ImmutableBytesWritable k,
>> >                Iterator<HbaseMapWritable<byte[], byte[]>>
v,
>> >                OutputCollector<ImmutableBytesWritable, BatchUpdate>
>> output,
>> >                Reporter r) throws IOException {
>> >            while (v.hasNext()) {
>> >                r.setStatus("Reducer begin committing row: " + new
>> > String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
>> >                BatchUpdate bu = new BatchUpdate(k.get());
>> >                int cellCnt = 0;
>> >                while (v.hasNext()) {
>> >                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
>> >                    Iterator<Entry<byte[], byte[]>> iter
=
>> >  hmw.entrySet().iterator();
>> >                    while(iter.hasNext()){
>> >                        Entry<byte[], byte[]> e = iter.next();
>> >                        bu.put(e.getKey(), e.getValue());
>> >                        //System.err.println("now add cell: "+e+"
cell
>> > count: " + cellCnt + new Date());
>> >                        if(++cellCnt>1000){
>> > *                            output.collect(k, bu); //this line
coz
>> timeout.
>> > *
>> > *                            r.setStatus("Reducer done committing
" + new
>> > String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new
>> String(e.getValue(),
>> > HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
>> >                            bu = new BatchUpdate(k.get());
>> >                            cellCnt = 0;
>> >                        }
>> >                    }
>> >                }
>> >                if(cellCnt>0){
>> >                    output.collect(k, bu);
>> >                }
>> >            }
>> >
>>
>>
>> Try calling out.collect every ten cells?
>>
>> You are using TableOutputFormat?  Its buffering inserts to the table?
>> If so, configure it to not buffer so much?
>>
>
> I think the 1000 cells is really small now, previously I set it to 10000.
> yes, I am using the TableOutputFormat class, I think it buffers cause I
> didn't disable the buffering. all the reducer side configuration is done by
> the following line:
>         "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
> c);"
>
>
> St.Ack
>>
>

Mime
View raw message