Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (nike.apache.org: domain of jagadish.bihani@pubmatic.com
 designates 64.78.56.60 as permitted sender)
Message-ID: <510BA23E.7020305@pubmatic.com>
Date: Fri, 1 Feb 2013 16:38:46 +0530
From: Jagadish Bihani <jagadish.bihani@pubmatic.com>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:16.0) Gecko/20121011 Thunderbird/16.0.1
MIME-Version: 1.0
To: <user@flume.apache.org>
Subject: Re: In flume-ng is there any advantages of 2-tier topology in  a
 cluster of  30-40 nodes?
References: <5108B82F.3050201@pubmatic.com>
 <D73043CC-C99D-4CA9-9C03-96398BC625E6@gmail.com>
 <510931A5.6050801@pubmatic.com>
In-Reply-To: <510931A5.6050801@pubmatic.com>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

Hi

Does someone have any inputs on this?

Just to summarize the questions again....

-- In a cluster with small number of nodes (say 30 -50) is it sufficient
to use only 1 tier architecture in flume?
-- How does 2-tier architecture  help in getting better HA in the above 
environment?

Regards,
Jagadish

On 01/30/2013 08:13 PM, Jagadish Bihani wrote:
> Hi
>
> Thanks  Alexander for the reply.
> I have added my thoughts in line.
>
> On 01/30/2013 11:56 AM, Alexander Alten-Lorenz wrote:
>> Hi,
>>
>> If the agents (Tier 1) have access to HDFS, each single client can 
>> put data into HDFS. But this doesn't make really sense, instead you 
>> want different files from different hosts in a structured view (maybe 
>> per host a directory, the contents inside split into buckets).
> -- But if number of clients are lesser (say 30-40) why doesn't it make 
> sense to write directly?
> Because ultimately purpose is to deliver the source data to HDFS 
> directly. (say in a single HDFS directory).
>> When you implement a Tier 2 (maybe 2 or more servers who has access 
>> to HDFS), you can have more features like loadbalancing, HA and 
>> mirrored sinks, as example (one sink put the data into HDFS, the 
>> other sink into a other system for backup maybe). For stability and 
>> reliability a Tier 2 architecture is recommend. And made some things 
>> easier ;)
> -- I didnt get the point how we get HA and load balancing using 2 
> tiers.  e.g.
> 1. If HDFS goes down then both in 1 tier case and 2 tier
> case channel will grow until its maximum size.
> 2. If in 1-tier scenario one node goes down then its data wont reach 
> HDFS.
> Similarly in 2 tier scenario : if a node from 1st tier goes down then 
> its data
> wont reach HDFS.
>
> Could you please elaborate if I am missing something?
>>
>> Cheers,
>>   Alex
>>
>> On Jan 30, 2013, at 7:05 AM, Jagadish Bihani 
>> <jagadish.bihani@pubmatic.com> wrote:
>>
>>> Hi
>>>
>>> In our scenario there are around 30 machines from which we want to 
>>> put data into HDFS.
>>>
>>> Now the approach we thought of initially was:
>>>
>>> 1. First tier  : Agent which collect data from source then pass it 
>>> to avro sink.
>>> 2. Second tier:  Lets call those agents 'collectors' which collect 
>>> data from First tier agents and then dump it to HDFS.
>>> (Second tier agents are fewer in number say 4:1)
>>>
>>> Instead of above topology if I simply use HDFS sink in first tier 
>>> agents. It can serve the purpose.
>>> And also number of nodes are lesser (say 30) that won't hurt HDFS 
>>> namenode too much compared
>>> to if number of nodes were say 1000.
>>>
>>> But apart from that I don't say any advantage of adding the 2nd tier.
>>> Is there any advantage I am missing in terms of failover, HDFS 
>>> performance or any other parameter?
>>>
>>> Regards,
>>> Jagadish
>> -- 
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>