flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 周梦想 <abloz...@gmail.com>
Subject Re: flume tail source problem and performance
Date Wed, 06 Feb 2013 02:47:12 GMT
Thanks to JS and GuoWei, I'll try this.

Best Regards,
Andy Zhou

2013/2/4 GuoWei <wei.guo@wbkit.com>

>
>  Exactly, It's very easy to write custom source and sink in flume. You can
> just follow the default source or sink as example. And export as jar file
> to the lib folder of flume. Then you can configure your own source or sink
> in flume conf file.
>
>
> Best Regards
> Guo Wei
>
>
> On 2013-2-4, at 下午4:13, Jeong-shik Jang <jsjang@gruter.com> wrote:
>
>  Yes, you can; Flume plugin framework provides easy way to implement and
> apply your own source, deco and sink.
>
> -JS
>
> On 2/4/13 5:07 PM, 周梦想 wrote:
>
> Hi JS,
>
>  Thank you for your reply. So there is big shortness of collect log using
> flume. can I write my own agent to send logs  via  thrift protocol directly
> to collector server?
>
>  Best Regards,
> Andy Zhou
>
>
> 2013/2/4 Jeong-shik Jang <jsjang@gruter.com>
>
>>  Hi Andy,
>>
>> 1. "startFromEnd=true" in your source configuration means data missing
>> can happen at restart in tail side because flume will ignore any data event
>> generated during restart and start at the end all the time.
>> 2. With agentSink, data duplication can happen due to ack delay from
>> master or at agent restart.
>>
>> I think it is why Flume-NG doesn't support tail any more but does let
>> user handle using script or program; tailing is a tricky job.
>>
>> My suggestion is to use agentBEChain in agent tier, and DFO in collector
>> tier; you can still lose some data during failover at failure.
>> To minimize loss and duplication, implementing checkpoint function in
>> tail also can help.
>>
>> Having monitoring system to detecting failure is very important as well,
>> so that you can notice failure and do recovering reaction quickly.
>>
>> -JS
>>
>>
>> On 2/4/13 4:27 PM, 周梦想 wrote:
>>
>> Hi  JS,
>> We can't accept agentBESink. Because this logs are important for data
>> analysis, we can't make any errors of the data. losing data, duplication
>> are all not acceptable.
>> one agent's configure is :   tail("H:/game.log", startFromEnd=true) agentSink("hadoop48",
>> 35853)
>> every time this windows agent restart, it will resend all the data to
>> collector server.
>> if some reason we restart the agent node, we can't get the mark of log
>> where the agent have sent.
>>
>>
>> 2013/1/29 Jeong-shik Jang <jsjang@gruter.com>
>>
>>> Hi Andy,
>>>
>>> As you set startFromEnd option true, resend might be caused by DFO
>>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
>>> events in different stages(logged, writing, sending and so on) rolls back
>>> to logged stage, which means resending and duplication.
>>>
>>> And, for better performance, you may want to use agentBESink instead of
>>> agentDFOSink.
>>> I recommend to use agentBEChain for failover in case of failure in
>>> collector tier if you have multiple collectors.
>>>
>>> -JS
>>>
>>>
>>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>>
>>>> Hi,
>>>>
>>>> you could use tail -F, but this depends on the external source. Flume
>>>> hasn't control about. You can write your own script and include this.
>>>>
>>>> What's the content of:
>>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>>>
>>>> - Alex
>>>>
>>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ablozhou@gmail.com> wrote:
>>>>
>>>>  hello,
>>>>> 1. I want to tail a log source and write it to hdfs. below is
>>>>> configure:
>>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d
>>>>> ","%{host}-",5000,raw),collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>>
>>>>>
>>>>> I found if I restart the agent node, it will resend the content of
>>>>> game.log
>>>>> to collector. There are some solutions to send logs from where I
>>>>> haven't
>>>>> sent before? Or I have to make a mark myself or remove the logs
>>>>> manually
>>>>> when restart the agent node?
>>>>>
>>>>> 2. I tested performance of flume, and found it's a bit slow.
>>>>> if I using configure as above, there are only 50MB/minute.
>>>>> I changed the configure to below:
>>>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>>>> agentDFOSink("hadoop48",35853);
>>>>>
>>>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d
>>>>> ","%{host}-",5000,raw),collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>>
>>>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>>>>> 100MB/minute.
>>>>>
>>>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>>>
>>>>> someone give me any ideas?
>>>>>
>>>>> thanks!
>>>>>
>>>>> Andy
>>>>>
>>>> --
>>>> Alexander Alten-Lorenz
>>>> http://mapredit.blogspot.com
>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>>>
>>>>
>>>>
>>>>
>>>
>>>   --
>>> Jeong-shik Jang / jsjang@gruter.com
>>> Gruter, Inc., R&D Team Leader
>>> www.gruter.com
>>> Enjoy Connecting
>>>
>>>
>>
>>
>> --
>> Jeong-shik Jang / jsjang@gruter.com
>> Gruter, Inc., R&D Team Leaderwww.gruter.com
>> Enjoy Connecting
>>
>>
>
>
> --
> Jeong-shik Jang / jsjang@gruter.com
> Gruter, Inc., R&D Team Leaderwww.gruter.com
> Enjoy Connecting
>
>
>

Mime
View raw message