flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joao Salcedo <joao.salc...@gmail.com>
Subject Re: seeking help on flume cluster deployment
Date Fri, 10 Jan 2014 04:58:13 GMT
Hi Chen,

Maybe it would be worth checking this
http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client

Regards,

Joao


On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Have you taken a look at the load balancing rpc client?
>
>
> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>
>> Jeff,
>> I have read this ppt at the beginning, but didn't find solution to my
>> user case. To simplify my case, I only have 1 data source(composed of 5
>> socket server)  and i am looking for a fault tolerant deployment of flume,
>> that can read from this single data source and sink to hdfs in fault
>> tolerant mode: when one node dies, another flume node can pick up and
>> continue;
>> Thanks,
>> Chen
>>
>>
>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>
>>> Chen,
>>>
>>> Have you taken a look at this presentation on Planning and Deploying
>>> Flume from ApacheCon?
>>>
>>>
>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>
>>> It may have the answers you need.
>>>
>>> Best,
>>>
>>> Jeff
>>>
>>>
>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>
>>>> Thanks Saurabh.
>>>> If that is the case, I am actually thinking about using storm spout to
>>>> talk to our socket server so that the storm cluster can take care of the
>>>> reading socket server part. Then in each storm node, start a flume agent,
>>>> listening on a RPC port and write to HDFS(with fail over) .Then in the
>>>> storm bolt, simply send the data to RPC so that flume can get it.
>>>> How do you think of this setup? It takes care of both failover on the
>>>> source(by storm) and on the sink(by flume) But It looks a little
>>>> complicated for me.
>>>> Chen
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.com>wrote:
>>>>
>>>>> Hi Chen,
>>>>>
>>>>> I think Flume doesn't have a way to configure multiple sources
>>>>> pointing to same data source. Of course you can do that, but you will
end
>>>>> up with duplicate data. Flume offers fail over at the sink level.
>>>>>
>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>>>
>>>>>> Ok. so after more researching:) It seems that what i need is the
>>>>>> failover for agent source, (not fail over for sink):
>>>>>> If one agent dies, another same kind of agent will start running.
>>>>>> Does flume support this scenario?
>>>>>> Thanks,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <chen.apache.solr@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> After reading more docs, it seems that if I want to achieve my
goal,
>>>>>>> i have to do the following:
>>>>>>> 1. Having one agent with the custom source running on one node.
This
>>>>>>> agent reads from those 5 socket server, and sink to some kind
of sink(maybe
>>>>>>> another socket?)
>>>>>>> 2. On another(or more) machines, setting up collectors that read
>>>>>>> from the agent sink in 1, and sink to hdfs.
>>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>>
>>>>>>> But it seems to be overskilled in my case: in 1, i can already
sink
>>>>>>> to hdfs. Since the data available at socket server are much faster
than the
>>>>>>> data translation part.  I want to be able to later add more nodes
to do the
>>>>>>> translation job. so what is the correct setup?
>>>>>>> Thanks,
>>>>>>> Chen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <
>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>
>>>>>>>> Guys,
>>>>>>>> In my environment, the client is 5 socket servers. Thus i
wrote a
>>>>>>>> custom source spawning 5 threads reading from each of them
infinitely,and
>>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng
>>>>>>>> agent.
>>>>>>>>
>>>>>>>> But how can i deploy this in distributed mode(cluster)? I
am
>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned
in the doc.
>>>>>>>> Does it apply to my case? How can I separate my agent/collect/storage?
>>>>>>>> Apparently i can only have one agent running: multiple agent
will result in
>>>>>>>> getting duplicates from the socket server. But I want that
if one agent
>>>>>>>> dies, other agent can take it up. I would also like to be
able to add
>>>>>>>> horizontal scalability for writing to hdfs. How can I achieve
all this?
>>>>>>>>
>>>>>>>> thank you very much for your advice.
>>>>>>>> Chen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mailing List Archives,
>>>>> QnaList.com
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message