crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Roling (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-515) Decrease probability of collision on Crunch temp directories
Date Thu, 30 Apr 2015 13:46:06 GMT


Ben Roling commented on CRUNCH-515:

{quote}if the cleanup jobs that folks were using expected to see the string "crunch-" followed
by some number of digits, wouldn't the UUID string (which is hex IIRC) cause their cleanup
scripts to miss some directories b/c the pattern wouldn't match?{quote}

The /tmp cleanup I intend for us to be doing in our clusters will be more general even than
just "crunch-" cleanup.  I'm expecting we will delete _anything_ in /tmp older than X days.
 That said, you're right that it is possible there are some other crunch consumers that have
set up specific enough /tmp/crunch- cleanup patterns to be broken by this change.  My guess
is that risk is relatively small but that is just my opinion.

any idea if those stray Crunch dirs are being left around by successful jobs, or jobs that
have crashed

We don't have nearly as many failed or killed jobs as we have stray /tmp/crunch-* directories
so most of them must be from successful jobs.  I don't see any obvious way for me to trace
back from the stray directories to the jobs that created them to be able to do analytics to
identify which jobs are leaving behind the most stray directories.  I suppose it _might_ be
possible with searches of the job logs but I don't have an easy mechanism available to me
to search across all of those at the moment.  I can look into it a bit more though.

I talked with [~mkwhitacre] about this previously and he had a theory about some activity
occurring after pipeline.done() that was resulting in the temp dirs being left behind.  I
don't remember the specifics of the theory and I haven't had a chance to try to validate it

> Decrease probability of collision on Crunch temp directories
> ------------------------------------------------------------
>                 Key: CRUNCH-515
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.4, 0.11.0
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>         Attachments: CRUNCH-515-1.patch
> I've heard reports of failures of Crunch pipelines at our organization due to collision
on temp directories.
> Take the following stack trace from an old internal email thread I dug up as an example:
> {noformat}
> 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
>     at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(
>     at org.apache.hadoop.mapred.JobClient$
>     at org.apache.hadoop.mapred.JobClient$
>     at Method)
>     at
>     at
>     at org.apache.hadoop.mapred.JobClient.submitJobInternal(
>     at org.apache.hadoop.mapreduce.Job.submit(
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(
>     at
>     at$000(
>     at$
>     at
> {noformat}
> What we found in this case is the pre-existing directory was rather old.  It hung around
because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory.  We
intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but
I think the chances of a collision will still be high enough that failures like this might
still happen on occasion.
> The temp directory Crunch chooses is a random 31-bit value:
> I say 31 bit value because it comes from a 32-bit random integer but only includes positive
values, thereby excluding 1 bit.
> The following blog post shows some probabilities for 32-bit hash collisions, which are
essentially the same problem:
> Since we're dealing with 31 bits instead of 32 the probabilities will be higher than
expressed there for 32 bits.  Even with 32 bits the probability of collision is 1 in 100 with
just 9292 values.
> I have not done any thorough investigation to understand why, but in our production environment
we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per
day.  Depending on how aggressive we get with a scheduled job to clean old stuff out of temp
we could still have a realistic chance of hitting a collision.
> My proposal is to change the random integer component of the temp path to a UUID or something
similar to make it drastically more unlikely that a collision will ever occur regardless of
whether or not "/tmp" is ever cleaned up.

This message was sent by Atlassian JIRA

View raw message