pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-5290) User Cache upload contention can cause job failures
Date Fri, 11 Aug 2017 15:57:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Erik Krogen updated PIG-5290:
-----------------------------
    Description: 
We recently enabled the User Cache (PIG-2672) feature and found that occasionally jobs would
fail because of contention when uploading JARs into the cache. Although the cache is designed
to be fail-safe, i.e. to fall back to normal behavior if anything goes wrong by catching all
{{IOException}}, the portion of code which closes the output stream _is not_ wrapped within
a {{try}} statement and thus an exception during the closing of that stream causes the entire
job to fail. If multiple jobs are attempting to upload the same JAR failure simultaneously,
the contention can cause this close statement to fail.

The current strategy also has two other flaws. First, consider the scenario where job A begins
uploading jar X. Job B also needs jar X, sees that the file exists, and launches its tasks.
Yet, job A has not yet finished uploading jar X (perhaps it is large). So, the tasks are localizing
a half-completed version of jar X. Second, the original design allowed for the same JAR (identical
contents) to be shared between jobs even if a different name was used. In PIG-3815, however,
this ability was removed, and now JARs are only shared if they have the same name.

I propose we solve both of these issues simultaneously by returning to the listStatus based
behavior (used prior to PIG-3815), but filter out entries ending in {{.tmp}}. When uploading,
upload to {{randomNumber.tmp}}, then once the file is completed, do a rename to the original
name of the JAR file. 

An alternative design is to use a single canonicalized name for all JAR files (they will still
be unique since they are inside of directories based on their SHA1). Upload to a tmp file
as previously described, then rename to the canonical name. This removes the need to do a
listStatus call; however it will result in classpaths that are human unreadable since the
name of the JAR file has been lost. I think it's worth it from a debugging standpoint to go
with the first design.

  was:
We recently enabled the User Cache (PIG-2672) feature and found that occasionally jobs would
fail because of contention when uploading JARs into the cache. Although the cache is designed
to be fail-safe, i.e. to fall back to normal behavior if anything goes wrong by catching all
{{IOException}}s, the portion of code which closes the output stream _is not_ wrapped within
a {{try}} statement and thus an exception during the closing of that stream causes the entire
job to fail. If multiple jobs are attempting to upload the same JAR failure simultaneously,
the contention can cause this close statement to fail.

The current strategy also has two other flaws. First, consider the scenario where job A begins
uploading jar X. Job B also needs jar X, sees that the file exists, and launches its tasks.
Yet, job A has not yet finished uploading jar X (perhaps it is large). So, the tasks are localizing
a half-completed version of jar X. Second, the original design allowed for the same JAR (identical
contents) to be shared between jobs even if a different name was used. In PIG-3815, however,
this ability was removed, and now JARs are only shared if they have the same name.

I propose we solve both of these issues simultaneously by returning to the listStatus based
behavior (used prior to PIG-3815), but filter out entries ending in {{.tmp}}. When uploading,
upload to {{randomNumber.tmp}}, then once the file is completed, do a rename to the original
name of the JAR file. 

An alternative design is to use a single canonicalized name for all JAR files (they will still
be unique since they are inside of directories based on their SHA1). Upload to a tmp file
as previously described, then rename to the canonical name. This removes the need to do a
listStatus call; however it will result in classpaths that are human unreadable since the
name of the JAR file has been lost. I think it's worth it from a debugging standpoint to go
with the first design.


> User Cache upload contention can cause job failures
> ---------------------------------------------------
>
>                 Key: PIG-5290
>                 URL: https://issues.apache.org/jira/browse/PIG-5290
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Erik Krogen
>
> We recently enabled the User Cache (PIG-2672) feature and found that occasionally jobs
would fail because of contention when uploading JARs into the cache. Although the cache is
designed to be fail-safe, i.e. to fall back to normal behavior if anything goes wrong by catching
all {{IOException}}, the portion of code which closes the output stream _is not_ wrapped within
a {{try}} statement and thus an exception during the closing of that stream causes the entire
job to fail. If multiple jobs are attempting to upload the same JAR failure simultaneously,
the contention can cause this close statement to fail.
> The current strategy also has two other flaws. First, consider the scenario where job
A begins uploading jar X. Job B also needs jar X, sees that the file exists, and launches
its tasks. Yet, job A has not yet finished uploading jar X (perhaps it is large). So, the
tasks are localizing a half-completed version of jar X. Second, the original design allowed
for the same JAR (identical contents) to be shared between jobs even if a different name was
used. In PIG-3815, however, this ability was removed, and now JARs are only shared if they
have the same name.
> I propose we solve both of these issues simultaneously by returning to the listStatus
based behavior (used prior to PIG-3815), but filter out entries ending in {{.tmp}}. When uploading,
upload to {{randomNumber.tmp}}, then once the file is completed, do a rename to the original
name of the JAR file. 
> An alternative design is to use a single canonicalized name for all JAR files (they will
still be unique since they are inside of directories based on their SHA1). Upload to a tmp
file as previously described, then rename to the canonical name. This removes the need to
do a listStatus call; however it will result in classpaths that are human unreadable since
the name of the JAR file has been lost. I think it's worth it from a debugging standpoint
to go with the first design.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message