Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of publicnetworkservices@gmail.com
 designates 209.85.219.49 as permitted sender)
MIME-Version: 1.0
Date: Mon, 5 Aug 2013 10:58:57 -0700
Message-ID: 
 <CAPwWofG0VuciM6Nopq4JJKuuQQ+GVrFh2ATKCEYFTkU=UDHigw@mail.gmail.com>
Subject: Large-scale collection of logs from multiple Hadoop nodes
From: Public Network Services <publicnetworkservices@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b41cbb8f4d52804e3370f97

--047d7b41cbb8f4d52804e3370f97
Content-Type: text/plain; charset=ISO-8859-1

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop
cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of
which constantly produces Syslog events that need to be gathered an
analyzed at another point. The total amount of logs could be tens of
gigabytes per day, if not more, and the reception rate in the order of
thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using
flume) and collect them in one or more (less than 5) nodes in the cluster,
or in another location, whereby the logs will be processed by a either
constantly MapReduce job, or by non-Hadoop servers running some log
processing application.

Another approach could be to deposit all these events into a queuing system
like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

--047d7b41cbb8f4d52804e3370f97
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div><div><div><div>Hi...<br><br></div=
>I am facing a large-scale usage scenario of log collection from a Hadoop c=
luster and examining ways as to how it should be implemented.<br><br></div>
More specifically, imagine a cluster that has hundreds of nodes, each of wh=
ich constantly produces Syslog events that need to be gathered an analyzed =
at another point. The total amount of logs could be tens of gigabytes per d=
ay, if not more, and the reception rate in the order of thousands of events=
 per second, if not more.<br>
<br></div>One solution is to send those events over the network (e.g., usin=
g using flume) and collect them in one or more (less than 5) nodes in the c=
luster, or in another location, whereby the logs will be processed by a eit=
her constantly MapReduce job, or by non-Hadoop servers running some log pro=
cessing application.<br>
<br></div>Another approach could be to deposit all these events into a queu=
ing system like ActiveMQ or RabbitMQ, or whatever.<br><br></div>In all case=
s, the main objective is to be able to do real-time log analysis.<br><br>
</div>What would be the best way of implementing the above scenario?<br></d=
iv><div><br></div>Thanks!<br><br></div>PNS<br><br></div>

--047d7b41cbb8f4d52804e3370f97--