hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Nuttycombe <kris.nuttyco...@gmail.com>
Subject Re: Trying to figure out possible causes of this exception
Date Fri, 09 Apr 2010 22:34:23 GMT
Okay, there's a further wrinkle now. I'm getting the same error... but
on my FileOutputPath!

Exception in thread "main" java.io.FileNotFoundException: File does
not exist: hdfs://hadoop-eventlog01.socialmedia.com/test-batchEventLog/out/47c21d67-f7e1-442b-a142-a9a6f9a10a68/data
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)

I can understand how Hadoop needs to have some way to cope with
nonexistent input paths, but why is it trying to determine the status
of my output path in the getSplits method?



On Wed, Apr 7, 2010 at 9:31 AM, Kris Nuttycombe
<kris.nuttycombe@gmail.com> wrote:
> So, the issue is that the input path I specified was a directory, not a file.
> As a result, Hadoop helpfully assumed that I wanted a file called
> "data" in that directory to be the input, and proceeded down the path
> with that assumption, instead of failing fast. I had to go to the
> source code to figure out why it was doing this.
> I'm finding that Hadoop has this sort of behavior (assume a useless
> default instead of failing fast) in a number of locations, some of
> them highly problematic, such as the dreaded DrWho default user.) it
> was only after reading http://blog.rapleaf.com/dev/?p=382 that I
> figured out why some of my services are losing data - due to the
> hadoop libs falling back to DrWho under strange conditions, then
> throwing a permissions exception when attempting to write a file,
> which subsequently kills a buffer-flush thread of a long-lived
> process...
> It would be very helpful if Hadoop were to fail fast when encountering
> incorrect configuration rather than assuming a default which will
> essentially never be used in a production environment. Both of these
> issues have cost me far more time and money in lost business ($50k
> just this week thanks do DrWho) than failing fast would have done.
> Thanks,
> Kris
> On Wed, Apr 7, 2010 at 6:23 AM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
>> hi Kris,
>> Seems your program can not find the input file. Have you done a hadoop fs
>> -ls to verify that the file exists? Also, the path URL should be
>> hdfs://......
>> Thanks and Regards,
>> Sonal
>> www.meghsoft.com
>> On Wed, Apr 7, 2010 at 1:16 AM, Kris Nuttycombe <kris.nuttycombe@gmail.com>
>> wrote:
>>> Exception in thread "main" java.io.FileNotFoundException: File does
>>> not exist: hdfs:///test-batchEventLog/metrics/data
>>>        at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>>>        at
>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>        at
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>>>        at
>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>>>        at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>>>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>        at reporting.HDFSMapReduceQuery.execute(HDFSMetricsQuery.scala:60)
>>> My job config contains the following:
>>>    println("using input path: " + inPath)
>>>    println("using output path: " + outPath)
>>>    FileInputFormat.setInputPaths(job, inPath);
>>>    FileOutputFormat.setOutputPath(job, outPath)
>>> with input & output paths printed out as:
>>> using input path: hdfs:/test-batchEventLog
>>> using output path:
>>> hdfs:/test-batchEventLog/out/03d24392-9bd9-4b23-8240-aceb54b3473c
>>> Any ideas why this would be occurring?
>>> Thanks,
>>> Kris

View raw message