flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Wang <chen.apache.s...@gmail.com>
Subject Re: seeking help on flume cluster deployment
Date Fri, 10 Jan 2014 05:20:02 GMT
Jeff, Joao,
Thanks for the pointer!
I think i am getting close here:
1. set up a cluster of flume agent with redundancies, source as avro, sink
as HDFS.
2 use storm(not quite necessary) to read from our socket server, then in
the bolt, using flume client (load balancing rpc client) to send the event
to the agent set up in step 1.

Then I thus get all the benefit of storm and flume. Does this set up look
right to you?
thank you very much,
Chen


On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <joao.salcedo@gmail.com> wrote:

> Hi Chen,
>
> Maybe it would be worth checking this
> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client
>
> Regards,
>
> Joao
>
>
> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <jlord@cloudera.com> wrote:
>
>> Have you taken a look at the load balancing rpc client?
>>
>>
>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>
>>> Jeff,
>>> I have read this ppt at the beginning, but didn't find solution to my
>>> user case. To simplify my case, I only have 1 data source(composed of 5
>>> socket server)  and i am looking for a fault tolerant deployment of flume,
>>> that can read from this single data source and sink to hdfs in fault
>>> tolerant mode: when one node dies, another flume node can pick up and
>>> continue;
>>> Thanks,
>>> Chen
>>>
>>>
>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>>
>>>> Chen,
>>>>
>>>> Have you taken a look at this presentation on Planning and Deploying
>>>> Flume from ApacheCon?
>>>>
>>>>
>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>>
>>>> It may have the answers you need.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>>
>>>>> Thanks Saurabh.
>>>>> If that is the case, I am actually thinking about using storm spout to
>>>>> talk to our socket server so that the storm cluster can take care of
the
>>>>> reading socket server part. Then in each storm node, start a flume agent,
>>>>> listening on a RPC port and write to HDFS(with fail over) .Then in the
>>>>> storm bolt, simply send the data to RPC so that flume can get it.
>>>>> How do you think of this setup? It takes care of both failover on the
>>>>> source(by storm) and on the sink(by flume) But It looks a little
>>>>> complicated for me.
>>>>> Chen
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.com>wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>> I think Flume doesn't have a way to configure multiple sources
>>>>>> pointing to same data source. Of course you can do that, but you
will end
>>>>>> up with duplicate data. Flume offers fail over at the sink level.
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.solr@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Ok. so after more researching:) It seems that what i need is
the
>>>>>>> failover for agent source, (not fail over for sink):
>>>>>>> If one agent dies, another same kind of agent will start running.
>>>>>>> Does flume support this scenario?
>>>>>>> Thanks,
>>>>>>> Chen
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <
>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>
>>>>>>>> After reading more docs, it seems that if I want to achieve
my
>>>>>>>> goal, i have to do the following:
>>>>>>>> 1. Having one agent with the custom source running on one
node.
>>>>>>>> This agent reads from those 5 socket server, and sink to
some kind of
>>>>>>>> sink(maybe another socket?)
>>>>>>>> 2. On another(or more) machines, setting up collectors that
read
>>>>>>>> from the agent sink in 1, and sink to hdfs.
>>>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>>>
>>>>>>>> But it seems to be overskilled in my case: in 1, i can already
sink
>>>>>>>> to hdfs. Since the data available at socket server are much
faster than the
>>>>>>>> data translation part.  I want to be able to later add more
nodes to do the
>>>>>>>> translation job. so what is the correct setup?
>>>>>>>> Thanks,
>>>>>>>> Chen
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <
>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Guys,
>>>>>>>>> In my environment, the client is 5 socket servers. Thus
i wrote a
>>>>>>>>> custom source spawning 5 threads reading from each of
them infinitely,and
>>>>>>>>> the sink is hdfs(hive table). The work fine by running
flume-ng
>>>>>>>>> agent.
>>>>>>>>>
>>>>>>>>> But how can i deploy this in distributed mode(cluster)?
I am
>>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned
in the doc.
>>>>>>>>> Does it apply to my case? How can I separate my agent/collect/storage?
>>>>>>>>> Apparently i can only have one agent running: multiple
agent will result in
>>>>>>>>> getting duplicates from the socket server. But I want
that if one agent
>>>>>>>>> dies, other agent can take it up. I would also like to
be able to add
>>>>>>>>> horizontal scalability for writing to hdfs. How can I
achieve all this?
>>>>>>>>>
>>>>>>>>> thank you very much for your advice.
>>>>>>>>> Chen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mailing List Archives,
>>>>>> QnaList.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message