Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F03EA109B3 for ; Mon, 5 Aug 2013 17:59:32 +0000 (UTC) Received: (qmail 59713 invoked by uid 500); 5 Aug 2013 17:59:27 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 57870 invoked by uid 500); 5 Aug 2013 17:59:25 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 57855 invoked by uid 99); 5 Aug 2013 17:59:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Aug 2013 17:59:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of publicnetworkservices@gmail.com designates 209.85.219.49 as permitted sender) Received: from [209.85.219.49] (HELO mail-oa0-f49.google.com) (209.85.219.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Aug 2013 17:59:18 +0000 Received: by mail-oa0-f49.google.com with SMTP id n10so6764499oag.8 for ; Mon, 05 Aug 2013 10:58:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=4DxnbFt8pSRvFTm+Jp/GZ0Pl8S4IcDzsjHL9r86aKgQ=; b=rwp/uNCmN2vDjc/jAHUAE17UfKtsWffc1TgxZifo/yb+xTCRTRPEtFETuGmVQzULl2 5h8mwCVg7Z5I2Lyv0f8Iw/Vjy0+UH0HJL5vquVZPBV1p1t+d8GVvNzvrFOQeIRM14J45 Fg5ea9Z7yhpno5a467YzCUT5jfAm/M72Ce4OOE2LMwHG7rDEt1yaPDK3Ag9O89hvCREV 9KljKszJOohTY3wxqy6h1PC+XA+4u9jcohVzGnb/mq7c7/q2D3BDNP3PHU5hiQrHDCoo WmI96QjeLUe2CdG9OUgfiTAgQs3oltZQ5T1ez4gnF3nNnQkN5GLvcH/rjTjdPAGvjHz6 OPXA== MIME-Version: 1.0 X-Received: by 10.60.132.230 with SMTP id ox6mr15212635oeb.66.1375725537251; Mon, 05 Aug 2013 10:58:57 -0700 (PDT) Received: by 10.182.121.198 with HTTP; Mon, 5 Aug 2013 10:58:57 -0700 (PDT) Date: Mon, 5 Aug 2013 10:58:57 -0700 Message-ID: Subject: Large-scale collection of logs from multiple Hadoop nodes From: Public Network Services To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b41cbb8f4d52804e3370f97 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b41cbb8f4d52804e3370f97 Content-Type: text/plain; charset=ISO-8859-1 Hi... I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented. More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more. One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application. Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever. In all cases, the main objective is to be able to do real-time log analysis. What would be the best way of implementing the above scenario? Thanks! PNS --047d7b41cbb8f4d52804e3370f97 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop c= luster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of wh= ich constantly produces Syslog events that need to be gathered an analyzed = at another point. The total amount of logs could be tens of gigabytes per d= ay, if not more, and the reception rate in the order of thousands of events= per second, if not more.

One solution is to send those events over the network (e.g., usin= g using flume) and collect them in one or more (less than 5) nodes in the c= luster, or in another location, whereby the logs will be processed by a eit= her constantly MapReduce job, or by non-Hadoop servers running some log pro= cessing application.

Another approach could be to deposit all these events into a queu= ing system like ActiveMQ or RabbitMQ, or whatever.

In all case= s, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

--047d7b41cbb8f4d52804e3370f97--