Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4C691802F for ; Wed, 29 Apr 2015 17:30:06 +0000 (UTC) Received: (qmail 62678 invoked by uid 500); 29 Apr 2015 17:30:06 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 62634 invoked by uid 500); 29 Apr 2015 17:30:06 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 62619 invoked by uid 500); 29 Apr 2015 17:30:06 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 62616 invoked by uid 99); 29 Apr 2015 17:30:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2015 17:30:06 +0000 Date: Wed, 29 Apr 2015 17:30:06 +0000 (UTC) From: "Ben Roling (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (CRUNCH-515) Decrease probability of collision on Crunch temp directories MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ben Roling created CRUNCH-515: --------------------------------- Summary: Decrease probability of collision on Crunch temp directories Key: CRUNCH-515 URL: https://issues.apache.org/jira/browse/CRUNCH-515 Project: Crunch Issue Type: Improvement Components: Core Affects Versions: 0.11.0, 0.8.4 Reporter: Ben Roling Assignee: Josh Wills I've heard reports of failures of Crunch pipelines at our organization due to collision on temp directories. Take the following stack trace from an old internal email thread I dug up as an example: {noformat} 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974) at org.apache.hadoop.mapreduce.Job.submit(Job.java:582) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84) at java.lang.Thread.run(Thread.java:682) {noformat} What we found in this case is the pre-existing directory was rather old. It hung around because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory. We intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but I think the chances of a collision will still be high enough that failures like this might still happen on occasion. The temp directory Crunch chooses is a random 31-bit value: https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L326 I say 31 bit value because it comes from a 32-bit random integer but only includes positive values, thereby excluding 1 bit. The following blog post shows some probabilities for 32-bit hash collisions, which are essentially the same problem: http://preshing.com/20110504/hash-collision-probabilities/ Since we're dealing with 31 bits instead of 32 the probabilities will be higher than expressed there for 32 bits. Even with 32 bits the probability of collision is 1 in 100 with just 9292 values. I have not done any thorough investigation to understand why, but in our production environment we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per day. Depending on how aggressive we get with a scheduled job to clean old stuff out of temp we could still have a realistic chance of hitting a collision. My proposal is to change the random integer component of the temp path to a UUID or something similar to make it drastically more unlikely that a collision will ever occur regardless of whether or not "/tmp" is ever cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)