flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Alten-Lorenz <wget.n...@gmail.com>
Subject Re: flume tail source problem and performance
Date Mon, 04 Feb 2013 07:39:25 GMT
Hi Andy,

I mentioned more a own program / script to parse the data (instead tail -*) to have some control
about the contents. Note, when a flume agent will be restarted, the marker for tail will be
lost too. This comes from tail itself, flume hasn't a control about. 

- Alex

On Feb 4, 2013, at 8:33 AM, 周梦想 <ablozhou@gmail.com> wrote:

> Hi Alex,
> 
> You mean I write a script to check the directories?
> [zhouhh@Hadoop46 ag1]$ pwd
> /tmp/flume-zhouhh/agent/ag1
> [zhouhh@Hadoop46 ag1]$ ls
> dfo_error  dfo_import  dfo_logged  dfo_sending  dfo_writing  done  error
> import  logged  sending  sent  writing
> 
> how to check to avoid lost data and disable resend data ? clean sending dir?
> 
> thanks!
> Andy
> 
> 2013/1/29 Alexander Alten-Lorenz <wget.null@gmail.com>
> 
>> Hi,
>> 
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>> 
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>> 
>> - Alex
>> 
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ablozhou@gmail.com> wrote:
>> 
>>> hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure:
>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>> 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> 
>>> I found if I restart the agent node, it will resend the content of
>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>> 
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>> agentDFOSink("hadoop48",35853);
>>> 
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>> 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>> 100MB/minute.
>>> 
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>> 
>>> someone give me any ideas?
>>> 
>>> thanks!
>>> 
>>> Andy
>> 
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>> 
>> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF


Mime
View raw message