crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Barry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-515) Decrease probability of collision on Crunch temp directories
Date Mon, 19 Oct 2015 15:20:05 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963456#comment-14963456
] 

Nathan Barry commented on CRUNCH-515:
-------------------------------------

Side bar: While trying to investigate stray /tmp directories left behind in our pipeline we
noticed that if someone sets the config via {{setConfiguration(Configuration conf)}} in the
pipeline\[1\]  after initial pipeline creation the system now has 2 different temporary directories
created yet {{cleanup()}} is only cleaning up up the temp directory for the current config
\[2\] leaving the original temp directory behind.  Sadly this wasn't our issue and we have
yet to fully determine why some stray temp directories persist after pipeline completion.

So for some, cleaning up these alternate directories _could_ help reduce collision risk. 

\[1\] https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L110
\[2\] https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L346

> Decrease probability of collision on Crunch temp directories
> ------------------------------------------------------------
>
>                 Key: CRUNCH-515
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-515
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.4, 0.11.0
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>         Attachments: CRUNCH-515-1.patch, CRUNCH_515.patch
>
>
> I've heard reports of failures of Crunch pipelines at our organization due to collision
on temp directories.
> Take the following stack trace from an old internal email thread I dug up as an example:
> {noformat}
> 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
>     at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
>     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013)
>     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:394)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
>     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
>     at org.apache.hadoop.mapreduce.Job.submit(Job.java:582)
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340)
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277)
>     at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316)
>     at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113)
>     at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55)
>     at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84)
>     at java.lang.Thread.run(Thread.java:682)
> {noformat}
> What we found in this case is the pre-existing directory was rather old.  It hung around
because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory.  We
intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but
I think the chances of a collision will still be high enough that failures like this might
still happen on occasion.
> The temp directory Crunch chooses is a random 31-bit value:
> https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L326
> I say 31 bit value because it comes from a 32-bit random integer but only includes positive
values, thereby excluding 1 bit.
> The following blog post shows some probabilities for 32-bit hash collisions, which are
essentially the same problem:
> http://preshing.com/20110504/hash-collision-probabilities/
> Since we're dealing with 31 bits instead of 32 the probabilities will be higher than
expressed there for 32 bits.  Even with 32 bits the probability of collision is 1 in 100 with
just 9292 values.
> I have not done any thorough investigation to understand why, but in our production environment
we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per
day.  Depending on how aggressive we get with a scheduled job to clean old stuff out of temp
we could still have a realistic chance of hitting a collision.
> My proposal is to change the random integer component of the temp path to a UUID or something
similar to make it drastically more unlikely that a collision will ever occur regardless of
whether or not "/tmp" is ever cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message