Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF99618E38 for ; Mon, 19 Oct 2015 16:53:08 +0000 (UTC) Received: (qmail 80972 invoked by uid 500); 19 Oct 2015 16:53:05 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 80841 invoked by uid 500); 19 Oct 2015 16:53:05 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 80822 invoked by uid 500); 19 Oct 2015 16:53:05 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 80819 invoked by uid 99); 19 Oct 2015 16:53:05 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2015 16:53:05 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 1E7FD2C1F65 for ; Mon, 19 Oct 2015 16:53:05 +0000 (UTC) Date: Mon, 19 Oct 2015 16:53:05 +0000 (UTC) From: "Sean Owen (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-515) Decrease probability of collision on Crunch temp directories MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963594#comment-14963594 ] Sean Owen commented on CRUNCH-515: ---------------------------------- Better patch. This re-creates the temp dir if it doesn't exist, which appears to be how it's expected to work, and is coherent. At least, we probably don't want to change the behavior. Tests pass. Tests generate a lot of warnings though about done() not being called. Too noisy? maybe just nix this message and silently clean up? > Decrease probability of collision on Crunch temp directories > ------------------------------------------------------------ > > Key: CRUNCH-515 > URL: https://issues.apache.org/jira/browse/CRUNCH-515 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.8.4, 0.11.0 > Reporter: Ben Roling > Assignee: Josh Wills > Attachments: CRUNCH-515-1.patch > > > I've heard reports of failures of Crunch pipelines at our organization due to collision on temp directories. > Take the following stack trace from an old internal email thread I dug up as an example: > {noformat} > 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists > at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:394) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:582) > at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340) > at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277) > at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316) > at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113) > at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55) > at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84) > at java.lang.Thread.run(Thread.java:682) > {noformat} > What we found in this case is the pre-existing directory was rather old. It hung around because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory. We intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but I think the chances of a collision will still be high enough that failures like this might still happen on occasion. > The temp directory Crunch chooses is a random 31-bit value: > https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L326 > I say 31 bit value because it comes from a 32-bit random integer but only includes positive values, thereby excluding 1 bit. > The following blog post shows some probabilities for 32-bit hash collisions, which are essentially the same problem: > http://preshing.com/20110504/hash-collision-probabilities/ > Since we're dealing with 31 bits instead of 32 the probabilities will be higher than expressed there for 32 bits. Even with 32 bits the probability of collision is 1 in 100 with just 9292 values. > I have not done any thorough investigation to understand why, but in our production environment we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per day. Depending on how aggressive we get with a scheduled job to clean old stuff out of temp we could still have a realistic chance of hitting a collision. > My proposal is to change the random integer component of the temp path to a UUID or something similar to make it drastically more unlikely that a collision will ever occur regardless of whether or not "/tmp" is ever cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)