tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Bodor (Jira) <j...@apache.org>
Subject [jira] [Updated] (TEZ-4097) Report localHostname in Fetcher failure log messages
Date Tue, 05 Nov 2019 14:00:08 GMT

     [ https://issues.apache.org/jira/browse/TEZ-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

László Bodor updated TEZ-4097:
------------------------------
    Description: 
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch Failure from
host while connecting: other_host, attempt: InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0,
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0, spillId=-1] Informing
ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}

For debugging network/ssl/etc. issues on cluster, it would be convenient to see the local
host's name in these messages (which is present in the fetcher as localHostname property),
as in the logs collected by yarn cli, it's not obvious for the first sight.

The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] |orderedgrouped.FetcherOrderedGrouped|:
Failed to verify reply after connecting to other_host:13562 with 1 inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path
building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
{code}

  was:
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch Failure from
host while connecting: *other_host*, attempt: InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0,
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0, spillId=-1] Informing
ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}

For debugging network/ssl/etc. issues on cluster, it would be convenient to see the local
host's name in these messages (which is present in the fetcher as localHostname property),
as in the logs collected by yarn cli, it's not obvious for the first sight.

The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] |orderedgrouped.FetcherOrderedGrouped|:
Failed to verify reply after connecting to rizhangdebug10-2.gce.cloudera.com:13562 with 1
inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path
building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
{code}


> Report localHostname in Fetcher failure log messages
> ----------------------------------------------------
>
>                 Key: TEZ-4097
>                 URL: https://issues.apache.org/jira/browse/TEZ-4097
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Minor
>
> Currently, a fetch failure is reported like this:
> {code}
> 2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch Failure
from host while connecting: other_host, attempt: InputAttemptIdentifier [inputIdentifier=1,
attemptNumber=0, pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0,
spillId=-1] Informing ShuffleManager:
> java.net.SocketTimeoutException: Read timed out
> ...
> {code}
> For debugging network/ssl/etc. issues on cluster, it would be convenient to see the local
host's name in these messages (which is present in the fetcher as localHostname property),
as in the logs collected by yarn cli, it's not obvious for the first sight.
> The same applies to FetcherOrderedGrouped, which reports something like:
> {code}
> 2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] |orderedgrouped.FetcherOrderedGrouped|:
Failed to verify reply after connecting to other_host:13562 with 1 inputs pending
> javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX
path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message