crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-575) DistributedPipeline temp dir choice can collide with itself
Date Mon, 19 Oct 2015 11:02:05 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963161#comment-14963161
] 

Tom White commented on CRUNCH-575:
----------------------------------

Certainly an improvement. +1

> DistributedPipeline temp dir choice can collide with itself
> -----------------------------------------------------------
>
>                 Key: CRUNCH-575
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-575
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.12.0
>            Reporter: Sean Owen
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH_575.patch
>
>
> We've observed that Crunch jobs can fail because the output temp dir already exists:
> {code}
> 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
> at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
> {code}
> One possible cause is the choice of random directory name, which is based on a random
nonnegative 32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs,
which is not unimaginable.
> A suggested fix, at least for that theoretical cause, is to generate a much larger random
value. 64 bits should put this firmly in the realm of extremely improbably (billions, not
tens of thousands).
> (HT [~wilfreds] / CC [~tomwhite])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message