crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-575) DistributedPipeline temp dir choice can collide with itself
Date Mon, 19 Oct 2015 12:24:05 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963217#comment-14963217
] 

Gabriel Reid commented on CRUNCH-575:
-------------------------------------

This issue (or one very similar to it) is discussed in CRUNCH-515. 

[~srowen] could you take a quick look at that one first, and see if the underlying problem
that you're encountering is or isn't the same as the one mentioned on that ticket (crashing
pipelines or pipelines that weren't calling pipeline.done())? It would be good to have an
additional sample point to help determine if making this change will just be hiding a different
issue (which will lead to a huge number of temp directories), or if we are just running into
the limits of 32-bits.

On the other hand, if we want to really avoid collisions (and if this isn't due to pipelines
which aren't correctly being cleaned up), maybe a UUID is (even) better than a long as a randomizer
in the temp dir name.

> DistributedPipeline temp dir choice can collide with itself
> -----------------------------------------------------------
>
>                 Key: CRUNCH-575
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-575
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.12.0
>            Reporter: Sean Owen
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH_575.patch
>
>
> We've observed that Crunch jobs can fail because the output temp dir already exists:
> {code}
> 2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output
already exists
> at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
> {code}
> One possible cause is the choice of random directory name, which is based on a random
nonnegative 32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs,
which is not unimaginable.
> A suggested fix, at least for that theoretical cause, is to generate a much larger random
value. 64 bits should put this firmly in the realm of extremely improbably (billions, not
tens of thousands).
> (HT [~wilfreds] / CC [~tomwhite])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message