hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy Wan <ji...@indeed.com>
Subject Re: Recommendations on Job Status and Dependency Management
Date Fri, 14 Nov 2008 18:00:29 GMT
Figured I should respond to my own question and list the solution for
the archives:

Since I already had a bunch of existing MapReduce jobs created, I was able to
quickly migrate my code to Cascading to take care of all the inter-hadoop
job dependencies.

By making use of the MapReduceFlow and dumping those flows into a Cascade
with a CascadeConnector, I was able to throw out several hundred lines of
hand-created Thread and dependency management code in favor of an automated
solution that actually worked a wee bit better in terms of concurrency. I was
able to see an immediate increase in the utilization of my cluster.

I covered how I worked out the initial HDFS-dependencies in the other reply
to this message.

For determining the proper way to determine whether the trigger conditions
are met (reliance on outside processes for which there is no easy way to read
a signal), I'm currently polling a database for that data and I'm working
with Chris to add a hook into Cascade to allow pluggable predicates to
specify that condition.

So yeah, I'm sold on Cascading. =)

Relevant links:
http://www.cascading.org

Relevant API Docs
http://www.cascading.org/javadoc/cascading/flow/MapReduceFlow.html
http://www.cascading.org/javadoc/cascading/cascade/CascadeConnector.html
http://www.cascading.org/javadoc/cascading/cascade/Cascade.html

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr <path>" in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on
>usage of external resources. Can Cascading even do this?
>
>I'd also appreciate any recommendations on how best to tune the hadoop
>processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
>so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
>might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
>in the near future.

Mime
View raw message