hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2082) Support for alternative log aggregation mechanism
Date Thu, 22 May 2014 00:06:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005427#comment-14005427
] 

Zhijie Shen commented on YARN-2082:
-----------------------------------

Just think it out loudly. Instead of making another store based on HBase to host the aggregated
logs. Is it possible to reuse the timeline store to do it? I think the event stream data model
should be suitable in this case, and there's a pending work to scale out the timeline store
with HBase as well (YARN-2032). The additional benefit is that the interfaces for publish
and querying the data are ready, and we just need to change the hook or wrap them into a log
aggregation plugin.

> Support for alternative log aggregation mechanism
> -------------------------------------------------
>
>                 Key: YARN-2082
>                 URL: https://issues.apache.org/jira/browse/YARN-2082
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Ming Ma
>
> I will post a more detailed design later. Here is the brief summary and would like to
get early feedback.
> Problem Statement:
> Current implementation of log aggregation create one HDFS file for each {application,
nodemanager }. These files are relative small, in the range of 1-2 MB. In a large cluster
with lots of application and many nodemanagers, it ends up creating lots of small files in
HDFS. This creates pressure on HDFS NN on the following ways.
> 1. It increases NN Memory size. It is mitigated by having history server deletes old
log files in HDFS.
> 2. Runtime RPC hit on HDFS. Each log aggregation file introduced several NN RPCs such
as create, getAdditionalBlock, complete, rename. When the cluster is busy, such RPC hit has
impact on NN performance.
> In addition, to support non-MR applications on YARN, we might need to support aggregation
for long running applications.
> Design choices:
> 1. Don't aggregate all the logs, as in YARN-221.
> 2. Create a dedicated HDFS namespace used only for log aggregation.
> 3. Write logs to some key-value store like HBase. HBase's RPC hit on NN will be much
less.
> 4. Decentralize the application level log aggregation to NMs. All logs for a given application
are aggregated first by a dedicated NM before it is pushed to HDFS.
> 5. Have NM aggregate logs on a regular basis; each of these log files will have data
from different applications and there needs to be some index for quick lookup.
> Proposal:
> 1. Make yarn log aggregation pluggable for both read and write path. Note that Hadoop
FileSystem provides an abstraction and we could ask alternative log aggregator implement compatable
FileSystem, but that seems to an overkill.
> 2. Provide a log aggregation plugin that write to HBase. The scheme design needs to support
efficient read on a per application as well as per application+container basis; in addition,
it shouldn't create hotspot in a cluster where certain users might create more jobs than others.
For example, we can use hash($user+$applicationId} + containerid as the row key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message