hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Lorenz <wget.n...@gmail.com>
Subject Re: Large-scale collection of logs from multiple Hadoop nodes
Date Wed, 07 Aug 2013 11:52:35 GMT

the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog
source as well a loadbalancing channel. On top you can define multiple channels for different


sent via my mobile device


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <wuzesheng@xiaomi.com> wrote:
> We have the same scenario as you described. The following is our solution, just FYI:
> We installed a local scribe agent on every node of our cluster, and we have several central
scribe servers. We extended log4j to support writing logs to the local scribe agent,  and
the local scribe agents forward the logs to the central scribe servers, at last the central
scribe servers write these logs to a specified hdfs cluster used for offline processing.
> Then we use hive/impale to analyse  the collected logs.
> From: Public Network Services <publicnetworkservices@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> Hi...
> I am facing a large-scale usage scenario of log collection from a Hadoop cluster and
examining ways as to how it should be implemented.
> More specifically, imagine a cluster that has hundreds of nodes, each of which constantly
produces Syslog events that need to be gathered an analyzed at another point. The total amount
of logs could be tens of gigabytes per day, if not more, and the reception rate in the order
of thousands of events per second, if not more.
> One solution is to send those events over the network (e.g., using using flume) and collect
them in one or more (less than 5) nodes in the cluster, or in another location, whereby the
logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running
some log processing application.
> Another approach could be to deposit all these events into a queuing system like ActiveMQ
or RabbitMQ, or whatever.
> In all cases, the main objective is to be able to do real-time log analysis.
> What would be the best way of implementing the above scenario?
> Thanks!

View raw message