hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably
Date Tue, 23 Feb 2016 14:10:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158913#comment-15158913

Jason Lowe commented on YARN-4705:

bq. One RPC call to check the file size shouldn't be a big problem in general.

As I mentioned above, we _cannot_ rely on the file size to be accurate.  The file is being
actively written, and there's no guarantee the file size will be updated in a timely manner
after data is written.  There can be data in the file for hours and the file size could still
be zero.  In HDFS it will only be updated when the next block is allocated, so it could sit
at filesize of 0 for a very long time (depending upon how fast the writer is going) until
the filesize suddenly jumps to the blocksize when the writer passes the first block boundary.
 The only real way to know how much data is in the file is to read it -- we cannot rely on
what the namenode reports.

bq. After a scan of an empty file/failed parse, it gets loaded again, next scan round? Or
is it removed from the scan list?

The file is always revisted, errors or not, on the next scan round as long as the application
is active.  It opens the file then seeks to the last successfully read byte offset and tries
to read more.  If data is successfully read then it updates the byte offset for the next round,
rinse, repeat.

bq. Really a failure to parse the JSON or an empty file should be treated the same: try later
if the file size increases

Again, we cannot rely on the file size to be updated.  To reduce load on the namenode, the
writer is simply pushing the data out to the datanode -- it's not also making an RPC call
to the namenode to update the filesize.  The only actors involved are the writer, the datanode,
and the reader.  The namenode is oblivious to what's happening until the next block is allocated,
which could take a really long time if the writer is writing slowly.  Note that for these
files a slow writer is not a rare case, as it only writes when tasks change state.

I agree we need to handle this better, probably by making the error a bit less scary in the

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>                 Key: YARN-4705
>                 URL: https://issues.apache.org/jira/browse/YARN-4705
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
> During one of my own timeline test runs, I've been seeing a stack trace warning that
the CRC check failed in Filesystem.open() file; something the FS was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), looking at
the code in {{LogInfo.parsePath()}} that it considers a failure to open a file as unrecoverable.

> on some filesystems, this may not be the case, i.e. if its open for writing it may not
be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is still running?

This message was sent by Atlassian JIRA

View raw message