hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Wang <nathanw...@yahoo.com>
Subject A couple of usability problems
Date Tue, 25 Sep 2007 17:30:23 GMT
I have a couple of problems that I think the development team could enhance.  
I'm currently running a job that takes a whole day to finish.
1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
    07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input paths to process : 9090

Then I realized there were 3 files that were bad (couldn't be gunzipped).  
So, I removed them by doing,
    bin/hadoop  dfs  -rm  srcdir/FILExxx.gz

20 hours later, the job was failed.  And, I found a few errors in the log:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename ...FILExxx.gz

Is it possible that the runtime could adjust the input data set accordingly?

2) Checking the output directory first
I started my job with the standard command line,
    bin/hardoop  jar  myjob.jar  srcdir  resultdir

Then, after many long hours, the job was about to finish with
    ...INFO mapred.JobClient:  map 100% reduce 100%
But, it ended up with
    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory ...resultdir already exists

Can we check the existence of the output directory at the very beginning, to save us a day?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message