Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 844D5F46E for ; Tue, 6 Aug 2013 06:11:36 +0000 (UTC) Received: (qmail 93615 invoked by uid 500); 6 Aug 2013 06:11:31 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 93514 invoked by uid 500); 6 Aug 2013 06:11:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 93506 invoked by uid 99); 6 Aug 2013 06:11:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Aug 2013 06:11:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of faithlessfriend@gmail.com designates 209.85.217.173 as permitted sender) Received: from [209.85.217.173] (HELO mail-lb0-f173.google.com) (209.85.217.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Aug 2013 06:11:23 +0000 Received: by mail-lb0-f173.google.com with SMTP id 10so172134lbf.4 for ; Mon, 05 Aug 2013 23:11:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+YsKCUf0KLmuzR0z77EHDhB6dGFCK8g3AMZYyfrV0FM=; b=wPhbWlNL1aClgq5LI2d3Tk07DbQbpdU08WkKQLqVBUgK7q/otnZ69nTv/UFrbGq+df 6/iyT09e+vx5QD0QebeOIXFqTA454tKYRQZ0OJjP3/ZbgtH7mIM+gW8uz9occeWsShCM dOx5giYU+6Rjs3+9qOK1T0dmWIa47mcQOPVhiYOfPKNpOct+f6/G5PqymoG++AA99YVK e6yqfibl945Dz3pSX2X7dmgRNWANLTBjw2K1BH/0E3vQlCE1nNVQ7zEWfqx5/ieqK5fE 3b+v5J5d3WOr2QdNLsQ8HozSwLfDg9oFcTQ24xM+fkJKupt90oNhI9FQQ/pBUgtznhgg Zmmg== MIME-Version: 1.0 X-Received: by 10.112.167.136 with SMTP id zo8mr468201lbb.33.1375769461909; Mon, 05 Aug 2013 23:11:01 -0700 (PDT) Received: by 10.114.64.2 with HTTP; Mon, 5 Aug 2013 23:11:01 -0700 (PDT) In-Reply-To: References: Date: Tue, 6 Aug 2013 09:11:01 +0300 Message-ID: Subject: Re: Large-scale collection of logs from multiple Hadoop nodes From: Andrei To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c2a4a811ecfe04e3414a13 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2a4a811ecfe04e3414a13 Content-Type: text/plain; charset=ISO-8859-1 We have similar requirements and build our log collection system around RSyslog and Flume. It is not in production yet, but tests so far look pretty well. We rejected idea of using AMQP since it introduces large overhead for log events. Probably you can use Flume interceptors to do real-time processing on your events, though I haven't tried anything like that earlier. Alternatively, you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend using Hadoop MapReduce for real-time processing of logs, and there's at least one important reason for this. As you probably know, Flume sources obtains new event and put it into channel, where sink then pulls it from. If we are talking about HDFS Sink, it has pull interval (normally time, but you can also use total size of events in channel). If this interval is large, you won't get real-time processing. And if it is small, Flume will produce large number of small files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple files in a single block, and minimal block size is 64M, so each of your 10-100KB of logs will become 64M (multiplied by # of replicas!). Of course, you can use some ad-hoc solution like deleting small files from time to time or combining them into a larger file, but monitoring of such a system becomes much harder and may lead to unexpected results. So, processing log events before they get to HDFS seems to be better idea. On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall wrote: > We have been using a flume like system for such usecases at significantly > large scale and it has been working quite well. > > Would like to hear thoughts/challenges around using zeromq alike systems > at good enough scale. > > inder > "you are the average of 5 people you spend the most time with" > On Aug 5, 2013 11:29 PM, "Public Network Services" < > publicnetworkservices@gmail.com> wrote: > >> Hi... >> >> I am facing a large-scale usage scenario of log collection from a Hadoop >> cluster and examining ways as to how it should be implemented. >> >> More specifically, imagine a cluster that has hundreds of nodes, each of >> which constantly produces Syslog events that need to be gathered an >> analyzed at another point. The total amount of logs could be tens of >> gigabytes per day, if not more, and the reception rate in the order of >> thousands of events per second, if not more. >> >> One solution is to send those events over the network (e.g., using using >> flume) and collect them in one or more (less than 5) nodes in the cluster, >> or in another location, whereby the logs will be processed by a either >> constantly MapReduce job, or by non-Hadoop servers running some log >> processing application. >> >> Another approach could be to deposit all these events into a queuing >> system like ActiveMQ or RabbitMQ, or whatever. >> >> In all cases, the main objective is to be able to do real-time log >> analysis. >> >> What would be the best way of implementing the above scenario? >> >> Thanks! >> >> PNS >> >> --001a11c2a4a811ecfe04e3414a13 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
We have similar requirements and build our log collection = system around RSyslog and Flume. It is not in production yet, but tests so = far look pretty well. We rejected idea of using AMQP since it introduces la= rge overhead for log events.=A0

Probably you can use Flume interceptors to do real-tim= e processing on your events, though I haven't tried anything like that = earlier. Alternatively, you can use Twitter Storm to handle your logs. Anyw= ay, I wouldn't recommend using Hadoop MapReduce for real-time processin= g of logs, and there's at least one important reason for this.=A0

As you probably know, Flume sources obtains= new event and put it into channel, where sink then pulls it from. If we ar= e talking about HDFS Sink, it has pull interval (normally time, but you can= also use total size of events in channel). If this interval is large, you = won't get real-time processing. And if it is small, Flume will produce = large number of small files in HDFS, say, of size 10-100KB. However, HDFS c= annot store multiple files in a single block, and minimal block size is 64M= , so each of your 10-100KB of logs will become 64M (multiplied by # of repl= icas!).=A0

Of course, you can use some ad-hoc solution= like deleting small files from time to time or combining them into a large= r file, but monitoring of such a system becomes much harder and may lead to= unexpected results. So, processing log events before they get to HDFS seem= s to be better idea.



On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <inder.pall@gmail.co= m> wrote:

We have been using a flume li= ke system for such usecases at significantly large scale and it has been wo= rking quite well.

Would like to hear thoughts/challenges around using zeromq a= like systems at good enough scale.

inder
"you are the average of 5 people you spend the most time with"

On Aug 5, 2013 11:29 PM, "Public Network Se= rvices" <publicnetworkservices@gmail.com> wrote:
Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop c= luster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of wh= ich constantly produces Syslog events that need to be gathered an analyzed = at another point. The total amount of logs could be tens of gigabytes per d= ay, if not more, and the reception rate in the order of thousands of events= per second, if not more.

One solution is to send those events over the network (e.g., usin= g using flume) and collect them in one or more (less than 5) nodes in the c= luster, or in another location, whereby the logs will be processed by a eit= her constantly MapReduce job, or by non-Hadoop servers running some log pro= cessing application.

Another approach could be to deposit all these events into a queu= ing system like ActiveMQ or RabbitMQ, or whatever.

In all case= s, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS


--001a11c2a4a811ecfe04e3414a13--