crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Roling (JIRA)" <>
Subject [jira] [Created] (CRUNCH-515) Decrease probability of collision on Crunch temp directories
Date Wed, 29 Apr 2015 17:30:06 GMT
Ben Roling created CRUNCH-515:

             Summary: Decrease probability of collision on Crunch temp directories
                 Key: CRUNCH-515
             Project: Crunch
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.11.0, 0.8.4
            Reporter: Ben Roling
            Assignee: Josh Wills

I've heard reports of failures of Crunch pipelines at our organization due to collision on
temp directories.

Take the following stack trace from an old internal email thread I dug up as an example:

2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
    at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(
    at org.apache.hadoop.mapred.JobClient$
    at org.apache.hadoop.mapred.JobClient$
    at Method)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(
    at org.apache.hadoop.mapreduce.Job.submit(
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(

What we found in this case is the pre-existing directory was rather old.  It hung around because
we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory.  We intend
to set up a job to delete stuff older than a couple of weeks or so out of /tmp but I think
the chances of a collision will still be high enough that failures like this might still happen
on occasion.

The temp directory Crunch chooses is a random 31-bit value:

I say 31 bit value because it comes from a 32-bit random integer but only includes positive
values, thereby excluding 1 bit.

The following blog post shows some probabilities for 32-bit hash collisions, which are essentially
the same problem:

Since we're dealing with 31 bits instead of 32 the probabilities will be higher than expressed
there for 32 bits.  Even with 32 bits the probability of collision is 1 in 100 with just 9292

I have not done any thorough investigation to understand why, but in our production environment
we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per
day.  Depending on how aggressive we get with a scheduled job to clean old stuff out of temp
we could still have a realistic chance of hitting a collision.

My proposal is to change the random integer component of the temp path to a UUID or something
similar to make it drastically more unlikely that a collision will ever occur regardless of
whether or not "/tmp" is ever cleaned up.

This message was sent by Atlassian JIRA

View raw message