flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Lord <jl...@cloudera.com>
Subject Re: seeking help on flume cluster deployment
Date Fri, 10 Jan 2014 03:49:38 GMT
Chen,

Have you taken a look at this presentation on Planning and Deploying Flume
from ApacheCon?

http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf

It may have the answers you need.

Best,

Jeff


On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:

> Thanks Saurabh.
> If that is the case, I am actually thinking about using storm spout to
> talk to our socket server so that the storm cluster can take care of the
> reading socket server part. Then in each storm node, start a flume agent,
> listening on a RPC port and write to HDFS(with fail over) .Then in the
> storm bolt, simply send the data to RPC so that flume can get it.
> How do you think of this setup? It takes care of both failover on the
> source(by storm) and on the sink(by flume) But It looks a little
> complicated for me.
> Chen
>
>
> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.com>wrote:
>
>> Hi Chen,
>>
>> I think Flume doesn't have a way to configure multiple sources pointing
>> to same data source. Of course you can do that, but you will end up with
>> duplicate data. Flume offers fail over at the sink level.
>>
>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>
>>> Ok. so after more researching:) It seems that what i need is the
>>> failover for agent source, (not fail over for sink):
>>> If one agent dies, another same kind of agent will start running.
>>> Does flume support this scenario?
>>> Thanks,
>>> Chen
>>>
>>>
>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>
>>>> After reading more docs, it seems that if I want to achieve my goal, i
>>>> have to do the following:
>>>> 1. Having one agent with the custom source running on one node. This
>>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe
>>>> another socket?)
>>>> 2. On another(or more) machines, setting up collectors that read from
>>>> the agent sink in 1, and sink to hdfs.
>>>> 3. Having a master node managing nodes in 1,2.
>>>>
>>>> But it seems to be overskilled in my case: in 1, i can already sink to
>>>> hdfs. Since the data available at socket server are much faster than the
>>>> data translation part.  I want to be able to later add more nodes to do the
>>>> translation job. so what is the correct setup?
>>>> Thanks,
>>>> Chen
>>>>
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>>
>>>>> Guys,
>>>>> In my environment, the client is 5 socket servers. Thus i wrote a
>>>>> custom source spawning 5 threads reading from each of them infinitely,and
>>>>> the sink is hdfs(hive table). The work fine by running flume-ng agent.
>>>>>
>>>>> But how can i deploy this in distributed mode(cluster)? I am confused
>>>>> about the 3 ties(agent,collector,storage) mentioned in the doc. Does
it
>>>>> apply to my case? How can I separate my agent/collect/storage? Apparently
i
>>>>> can only have one agent running: multiple agent will result in getting
>>>>> duplicates from the socket server. But I want that if one agent dies,
other
>>>>> agent can take it up. I would also like to be able to add horizontal
>>>>> scalability for writing to hdfs. How can I achieve all this?
>>>>>
>>>>> thank you very much for your advice.
>>>>> Chen
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Mailing List Archives,
>> QnaList.com
>>
>
>

Mime
View raw message