hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably
Date Tue, 23 Feb 2016 15:09:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159012#comment-15159012
] 

Steve Loughran commented on YARN-4705:
--------------------------------------

OK. so HDFS has guaranteed flush but no guarantees on modtime or size propagation; in contrast,
the local file:// FS is consistent with FileStatus.length and actual length, but doesn't flush
when told to, so can delay its writes until a CRC-worth of data has been written —and there
is no obvious way to turn this off for testing via config files.

On HDFS then: empty files length can't be interpreted as a reason to skip; so failures to
read are an error. An attempt must be made to read it, but any EOFexception or similar is
not a failure. That is: you can't skip on empty, just swallow the failure. Maybe at DEBUG
list the exception and current file status value. or just attempt to read() byte 0 after opening
file; an EOFException means "still empty"

That essentially means that until such a switch is provided, you cannot use the localfs as
a back end for ATS1.5 —even for testing. Or at least, you can write with it, but the data
won't be guaranteed to be visible until close() is called. You may not get any view of incomplete
apps —which is precisely what I'm seeing.

If this is the case, then that's something ATS1.5 can't fix: it will have to be in the documentation.



> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
>                 Key: YARN-4705
>                 URL: https://issues.apache.org/jira/browse/YARN-4705
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace warning that
the CRC check failed in Filesystem.open() file; something the FS was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), looking at
the code in {{LogInfo.parsePath()}} that it considers a failure to open a file as unrecoverable.

> on some filesystems, this may not be the case, i.e. if its open for writing it may not
be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message