hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1440) Yarn aggregated logs should be stored in a simpler format
Date Fri, 22 Nov 2013 21:42:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830335#comment-13830335
] 

Jason Lowe commented on YARN-1440:
----------------------------------

bq. My suggestion would be to simplify the log collection by collecting and writing the raw
log files into a directory structure as follows

I agree that approach would be simple, but it has a lot of issues at scale.  One of the biggest
issues with log aggregation on a large, busy cluster is the number of files it generates and
the write load it places on the namenode.   Storing the logs in HDFS 1-to-1 as they appear
in the container log directories on the nodes would be a *lot* of files.  Zillions of tiny
files is not something HDFS does particularly well.  We already have to set the log retention
period lower than we'd like on some of our large, busy clusters due to the namespace pressure
from aggregated logs, and it's already coalescing all of the logs for all of an app's containers
that ran on a particular node.

That being said, I totally agree the TFile format for aggregated logs is not very fun to wield
as a user.  I don't know the thought process that went into choosing it, but I suspect it
was a straightforward way to aggregate all of an app's logfiles on a node into a single file
in HDFS.

Maybe one way to get the benefit of both easy-to-access logs and less namespace pressure is
to go ahead and aggregate them as separate files but have a periodic process to archive logs
in a har to reduce the namespace.  That wouldn't address the significant additional write
load this approach would place on the namenode, however.

> Yarn aggregated logs should be stored in a simpler format
> ---------------------------------------------------------
>
>                 Key: YARN-1440
>                 URL: https://issues.apache.org/jira/browse/YARN-1440
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: ledion bitincka
>              Labels: log-aggregation, logs, tfile, yarn
>
> The log aggregation feature in Yarn is awesome! However, the file type and format in
which the log files are aggregated into (TFile) should either be much simpler or be made pluggable.
The current TFile format forces anyone who wants to see the files to either 
> a) use the web UI
> b) use the CLI tools (yarn logs)  or 
> c) write custom code to read the files 
> My suggestion would be to simplify the log collection by collecting and writing the raw
log files into a directory structure as follows: 
> {noformat}
> /{log-collection-dir}/{app-id}/{container-id}/{log-file-name} 
> {noformat}
> This way the application developers can (re)use a much wider array of tools to process
the logs. 
> For the readers who are not familiar with logs and their format you can find more info
the following two blog posts:
> http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/
> http://blogs.splunk.com/2013/11/18/hadoop-2-0-rant/



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message