hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-16) RPC call times out while indexing map task is computing splits
Date Tue, 21 Feb 2006 06:01:24 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-16?page=comments#action_12367148 ] 

Bryan Pendleton commented on HADOOP-16:
---------------------------------------

I have another idea.

First, switch the delegation of responsiblity for job assignment. Right now, it happens in
the JobTracker instance, in response to an obtainNewMapTask call. This scales very poorly.
In particular, it causes the RPC-timeouts if you do any sort of serious work in obtainNewMapTask.
There's another bug, just reported as HADOOP-43, which occurs if you spend too long in a RPC
call, and also described in the second comment, above.

So, instead of having the JobTracker do this reactively, either:
1) Precompute - probably most scalably done by starting a mini-job, which just computes the
list of who has precached data from a given FileSplit.
2) Compute on demand - as a TaskTracker job. This could work by some protocol of offering
the TaskTracker a set of possible jobs it could do work on, letting it pick the ones it thinks
are best, and return the remainders for assignment. This, of course, would only work well
for instances where tasks >> tasktracker instances.

I looked at implementing something like 1, but decided I think 2 is a much better option.
1 would put a lot more instantaneous demand on the namenode. Plus, once you've finished precomputing
the best nodes, if nodes come or go you don't really have a solution. 2 seems to distribute
both the work and some of the demand, and it makes it possible for the cluster to grow or
shrink dramatically without failing to take advantage of the local storage available at each
node. Unfortunately, without any pre-work, it's possible that, doing option 2, you'd pick
bad subsets of work to distribute to each node, and get no local I/O improvement at all.

I'd really like to see something done, perferably soon. With dozens of nodes, and hundreds
of gbs of data in my current problem set, it's very nearly impossible to get the current code
to make progress, without killing tasktrackers (some with lots of work units already completed).
I can do some of the coding, if there's agreement for what direction to push.

> RPC call times out while indexing map task is computing splits
> --------------------------------------------------------------
>
>          Key: HADOOP-16
>          URL: http://issues.apache.org/jira/browse/HADOOP-16
>      Project: Hadoop
>         Type: Bug
>   Components: mapred
>  Environment: MapReduce multi-computer crawl environment: 11 machines (1 master with
JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
>     Reporter: Chris Schneider
>  Attachments: patch.16
>
> We've been using Nutch 0.8 (MapReduce) to perform some internet crawling. Things seemed
to be going well until...
> 060129 222409 Lost tracker 'tracker_56288'
> 060129 222409 Task 'task_m_10gs5f' has been lost.
> 060129 222409 Task 'task_m_10qhzr' has been lost.
>    ........
>    ........
> 060129 222409 Task 'task_r_zggbwu' has been lost.
> 060129 222409 Task 'task_r_zh8dao' has been lost.
> 060129 222455 Server handler 8 on 8010 caught: java.net.SocketException: Socket closed
> java.net.SocketException: Socket closed
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>         at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
> 060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
> 060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'
> I'm hoping that someone could explain why task_m_cia5po got added to tracker_56288 after
this tracker was lost.
> The Crawl .main process died with the following output:
> 060129 221129 Indexer: adding segment: /user/crawler/crawl-20060129091444/segments/20060129200246
> Exception in thread "main" java.io.IOException: timed out waiting for response
>     at org.apache.nutch.ipc.Client.call(Client.java:296)
>     at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>     at $Proxy1.submitJob(Unknown Source)
>     at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>     at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
> However, it definitely seems as if the JobTracker is still waiting for the job to finish
(no failed jobs).
> Doug Cutting's response:
> The bug here is that the RPC call times out while the map task is computing splits. 
The fix is that the job tracker should not compute splits until after it has returned from
the submitJob RPC.  Please submit a bug in Jira to help remind us to fix this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message