hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6875) New aggregated log file format for YARN log aggregation.
Date Mon, 31 Jul 2017 19:23:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107821#comment-16107821

Wangda Tan commented on YARN-6875:

Thanks [~jlowe], 

bq. Quite a few important points to note here:
#1/#2 are true, however our original goal of the JIRA is not to just be a slightly better
than old format.

For #3, it is not true when append fails.

For example, we have a file which appended 3 times (did partial log aggregation for 3 times).
File looks like:

At 4-th time, append fails in middle (such as NM failure, etc.)

When we need to read logs, we need to go back all the way back to index-3, depends on how
much we write for Data-4, this could be costly.
And the worse thing is, if Data-4 is not fixed by some reason. In the future time we need
to read the app log again, we need to reverse-find where's the index-3.

There's another solution in my mind, in addition to Jason's suggestion before:

When we append logs for every partial log aggregation, we will append UUID + block_id for
every N bits (N could = 64MB for example). Data looks like:

If append fails because of some reason, we will go back to search the last UUID+block_ID.
For example:

The last UUID+block_id is UUID_x_y. So we will know that, the last corrupted data has y more
blocks in front of the position, so it will skip y * (BLOCK_SIZE + UUID_SIZE) bits. Which
will be better than scan blocks one-by-one.

Thoughts? [~xgong].

> New aggregated log file format for YARN log aggregation.
> --------------------------------------------------------
>                 Key: YARN-6875
>                 URL: https://issues.apache.org/jira/browse/YARN-6875
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Xuan Gong
>            Assignee: Xuan Gong
>         Attachments: YARN-6875-NewLogAggregationFormat-design-doc.pdf
> T-file is the underlying log format for the aggregated logs in YARN. We have seen several
performance issues, especially for very large log files.
> We will introduce a new log format which have better performance for large log files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message