hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy Wan <ji...@indeed.com>
Subject re: Recommendations on Job Status and Dependency Management
Date Wed, 12 Nov 2008 18:47:10 GMT
I was able to answer one of my own questions:

"Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information."

The following code was all that I needed:
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(configuration);
        Path path = new Path(filename);
        boolean fileExists = fileSystem.exists(path)

At first, the code didn't work as I expected because my working shell scripts
that made use of "hadoop/bin/hadoop jar my.jar" did not explicitly include
HADOOP_CONF_DIR in my classpath. Once I did that, everything worked just
fine.

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr <path>" in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on
>usage of external resources. Can Cascading even do this?
>
>I'd also appreciate any recommendations on how best to tune the hadoop
>processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
>so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
>might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
>in the near future.
>

-- 

Mime
View raw message