hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <...@yahoo-inc.com>
Subject Re: A couple of usability problems
Date Wed, 26 Sep 2007 06:00:57 GMT
On Sep 25, 2007, at 10:30 AM, Nathan Wang wrote:

> 1) Adjusting input set dynamically
> At the start, I had 9090 gzipped input data files for the job,
>     07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input  
> paths to process : 9090
> Then I realized there were 3 files that were bad (couldn't be  
> gunzipped).
> So, I removed them by doing,
>     bin/hadoop  dfs  -rm  srcdir/FILExxx.gz
> 20 hours later, the job was failed.  And, I found a few errors in  
> the log:
>     org.apache.hadoop.ipc.RemoteException: java.io.IOException:  
> Cannot open filename ...FILExxx.gz
> Is it possible that the runtime could adjust the input data set  
> accordingly?

As Devaraj pointed out this is possible, but in general I think it is  
correct to make this an error. The planning for the job must happen  
at the beginning before the job is launched and once the map has been  
assigned a file, if the mapper can't read the assigned input, it is a  
fatal problem. If failures are tolerable for your application, you  
can set the percent of mappers and reducers that can fail before the  
job is killed.

> Can we check the existence of the output directory at the very  
> beginning, to save us a day?

It does already. That was done back before 0.1 in HADOOP-3. Was your  
program launching two jobs or something? Very strange.

-- Owen

View raw message