Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EC1D5E100 for ; Fri, 1 Feb 2013 11:09:21 +0000 (UTC) Received: (qmail 71499 invoked by uid 500); 1 Feb 2013 11:09:21 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 71319 invoked by uid 500); 1 Feb 2013 11:09:20 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 71303 invoked by uid 99); 1 Feb 2013 11:09:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2013 11:09:19 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jagadish.bihani@pubmatic.com designates 64.78.56.60 as permitted sender) Received: from [64.78.56.60] (HELO hub023-ca-2.exch023.serverdata.net) (64.78.56.60) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2013 11:09:12 +0000 Received: from [172.16.2.160] (203.196.177.156) by west.exch023.serverdata.net (10.254.8.34) with Microsoft SMTP Server (TLS) id 14.2.318.1; Fri, 1 Feb 2013 03:08:50 -0800 Message-ID: <510BA23E.7020305@pubmatic.com> Date: Fri, 1 Feb 2013 16:38:46 +0530 From: Jagadish Bihani User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: Subject: Re: In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes? References: <5108B82F.3050201@pubmatic.com> <510931A5.6050801@pubmatic.com> In-Reply-To: <510931A5.6050801@pubmatic.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Does someone have any inputs on this? Just to summarize the questions again.... -- In a cluster with small number of nodes (say 30 -50) is it sufficient to use only 1 tier architecture in flume? -- How does 2-tier architecture help in getting better HA in the above environment? Regards, Jagadish On 01/30/2013 08:13 PM, Jagadish Bihani wrote: > Hi > > Thanks Alexander for the reply. > I have added my thoughts in line. > > On 01/30/2013 11:56 AM, Alexander Alten-Lorenz wrote: >> Hi, >> >> If the agents (Tier 1) have access to HDFS, each single client can >> put data into HDFS. But this doesn't make really sense, instead you >> want different files from different hosts in a structured view (maybe >> per host a directory, the contents inside split into buckets). > -- But if number of clients are lesser (say 30-40) why doesn't it make > sense to write directly? > Because ultimately purpose is to deliver the source data to HDFS > directly. (say in a single HDFS directory). >> When you implement a Tier 2 (maybe 2 or more servers who has access >> to HDFS), you can have more features like loadbalancing, HA and >> mirrored sinks, as example (one sink put the data into HDFS, the >> other sink into a other system for backup maybe). For stability and >> reliability a Tier 2 architecture is recommend. And made some things >> easier ;) > -- I didnt get the point how we get HA and load balancing using 2 > tiers. e.g. > 1. If HDFS goes down then both in 1 tier case and 2 tier > case channel will grow until its maximum size. > 2. If in 1-tier scenario one node goes down then its data wont reach > HDFS. > Similarly in 2 tier scenario : if a node from 1st tier goes down then > its data > wont reach HDFS. > > Could you please elaborate if I am missing something? >> >> Cheers, >> Alex >> >> On Jan 30, 2013, at 7:05 AM, Jagadish Bihani >> wrote: >> >>> Hi >>> >>> In our scenario there are around 30 machines from which we want to >>> put data into HDFS. >>> >>> Now the approach we thought of initially was: >>> >>> 1. First tier : Agent which collect data from source then pass it >>> to avro sink. >>> 2. Second tier: Lets call those agents 'collectors' which collect >>> data from First tier agents and then dump it to HDFS. >>> (Second tier agents are fewer in number say 4:1) >>> >>> Instead of above topology if I simply use HDFS sink in first tier >>> agents. It can serve the purpose. >>> And also number of nodes are lesser (say 30) that won't hurt HDFS >>> namenode too much compared >>> to if number of nodes were say 1000. >>> >>> But apart from that I don't say any advantage of adding the 2nd tier. >>> Is there any advantage I am missing in terms of failover, HDFS >>> performance or any other parameter? >>> >>> Regards, >>> Jagadish >> -- >> Alexander Alten-Lorenz >> http://mapredit.blogspot.com >> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >> >