crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Created] (CRUNCH-575) DistributedPipeline temp dir choice can collide with itself
Date Mon, 19 Oct 2015 07:46:05 GMT
Sean Owen created CRUNCH-575:

             Summary: DistributedPipeline temp dir choice can collide with itself
                 Key: CRUNCH-575
             Project: Crunch
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.12.0
            Reporter: Sean Owen
            Assignee: Josh Wills
            Priority: Minor

We've observed that Crunch jobs can fail because the output temp dir already exists:

2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(

One possible cause is the choice of random directory name, which is based on a random nonnegative
32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs, which is not

A suggested fix, at least for that theoretical cause, is to generate a much larger random
value. 64 bits should put this firmly in the realm of extremely improbably (billions, not
tens of thousands).

(HT [~wilfreds] / CC [~tomwhite])

This message was sent by Atlassian JIRA

View raw message