flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ossi <los...@gmail.com>
Subject Re: Reliability of Flume (autoE2EChain)
Date Mon, 12 Dec 2011 11:31:55 GMT
Hi,

I'll continue this monolog of mine.

It seems that autoDFOChain is more reliable that autoE2EChain, could this
be correct?
For past days we have added 1 node/agent with 4 flows using autoDFOChain.
We also changed 1 node/agent to use autoDFOChain.

Both of those has been working fine for 3-4 days.

Last node/agent still using autoE2EChain has been having problems with 1
flow.

Today I tried to reconfigure that agent with autoE2EChain without success.
As soon as I switched it to us autoDFOChain, it started to work.

Oddly, there is not any kind of errors visible neither in master nor
collector.

This is how I did reconfiguration:

exec unconfig ff-agent-http-error-fe-1
exec unconfig ff-agent-http-error-fe-2
exec unconfig ff-collector-http-error-fe

exec unmap server3 ff-agent-http-error-fe-1
exec unmap server3 ff-agent-http-error-fe-2
exec unmap server6 ff-collector-http-error-fe

exec decommission ff-agent-http-error-fe-1
exec decommission ff-agent-http-error-fe-2
exec decommission ff-collector-http-error-fe

exec purge ff-agent-http-error-fe-1
exec purge ff-agent-http-error-fe-2
exec purge ff-collector-http-error-fe

exec refreshAll

exec map server3 ff-agent-http-error-fe-1
exec map server3 ff-agent-http-error-fe-2
exec map server6 ff-collector-http-error-fe

exec config ff-agent-http-error-fe-1 ff-flow-http-error-fe
'tailDir("/logs/ff/httpd-fe-1/", "ff_error_log-\\d{4}-\\d{2}-\\d{2}$",
true)' autoDFOChain
exec config ff-agent-http-error-fe-2 ff-flow-http-error-fe
'tailDir("/logs/ff/httpd-fe-2/", "ff_error_log-\\d{4}-\\d{2}-\\d{2}$",
true)' autoDFOChain
exec config ff-collector-http-error-fe ff-flow-http-error-fe
autoCollectorSource
'collectorSink("hdfs://namenode:8020/flume/ff/httpd-fe/%Y-%m-%d/",
"%{host}-error-")'

waitForNodesActive 0 ff-agent-http-error-fe-1 ff-agent-http-error-fe-2
ff-collector-http-error-fe

exec refreshAll


Regards,
Ossi


On Fri, Dec 2, 2011 at 3:22 PM, Ossi <lossil@gmail.com> wrote:

> And one more thing: collector's Jetty is unresponsive again.
> It gives front page with content:
> Flume Administration
>
>     Flume's Agent
>
> But doesn't redirect nor server flumeagent.jsp.
>
> br,
> Ossi
>
>
> On Fri, Dec 2, 2011 at 3:12 PM, Ossi <lossil@gmail.com> wrote:
>
>> hi!
>>
>> Unfortunately this happened again:
>>
>> collector stopped to write one flow to hdfs. Other flows seems to work
>> fine from the same host.
>>
>> Here is last entry of it ath hdfs:
>> -rw-r--r--   3 flume supergroup        260 2011-12-01 16:27
>> /flume/aa/httpd-fe/2011-12-01/server2-ssl-access-20111201-172714555+0100.4251339811775625.00000225
>>
>> And logs from collector:
>> 2011-12-01 17:27:15,523 INFO
>> com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on port
>> 35858...
>> ....
>> 2011-12-01 17:27:44,798 INFO
>> com.cloudera.flume.handlers.hdfs.EscapedCustomDfsSink: Closing
>> hdfs://hadoop:8020/flume/aa/httpd-fe/2011-12-01/server2-ssl-access-20111201-172714555+0100.4251339811775625.00000225
>> 2011-12-01 17:27:44,798 INFO
>> com.cloudera.flume.handlers.hdfs.CustomDfsSink: Closing HDFS file:
>> hdfs://hq-priv-01:8020/flume/aa/httpd-fe/2011-12-01/server2-ssl-access-20111201-172714555+0100.4251339811775625.00000225.tmp
>> 2011-12-01 17:27:44,798 INFO
>> com.cloudera.flume.handlers.hdfs.CustomDfsSink: done writing raw file to
>> hdfs
>>
>> From agent there is related errors (at INFO level, why?):
>> 2011-12-01 17:27:19,407 INFO
>> com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on
>> event 'server2 [INFO Thu Dec 01 17:27:13 CET 2011] { AckChecksum :
>> (long)1085259347  (string) '^@^@^@^@@��S' (double)5.3618936E-315 } {
>> AckTag : 20111201-172709320+0100.2013898777141269.00000115 } { AckType :
>> msg } { tailSrcFile : ssl-aa_access_log-2011-12-01 } 1.2.3.4
>> [01/Dec/2011:17:27:13 +0100] www.foo.bar \"GET / HTTP/1.1\" 400 226 Age:-
>> \"-\" \"-\"' with error: Append failed java.net.SocketException: Connection
>> reset
>> 2011-12-01 17:27:19,408 INFO
>> com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port
>> 35858 closed
>> 2011-12-01 17:27:19,409 INFO
>> com.cloudera.flume.handlers.debug.InsistentOpenDecorator: open attempt 0
>> failed, backoff (1000ms): Failed to open thrift event sink to server6:35858
>> : java.net.ConnectException: Connection refused
>>
>> So, for me it looks that flow is (it still does) trying to use thrift
>> server on port 35858 at collector server (server6). which was closed for
>> some reason.
>>
>> Any ideas why this has happened?
>> And for me this looks like a bug. Unless it is a known issue.
>>
>> br,
>>
>> Ossi
>>
>>
>> On Wed, Nov 30, 2011 at 9:25 AM, Ossi <lossil@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm new on the list and I do hope that some of you can help me. :)
>>>
>>> We are testing flume with fully distributed configuration and isolated
>>> flows.
>>> setup:
>>> 1 master server (server5)
>>> 1 collector (server6)
>>> 2 agents (server2 and server3)
>>>
>>> Both agent servers has 8 logical nodes collecting apache httpd logs.
>>> There is 2 apache instances running and we want to collect
>>> http and https both with access and errors separately.
>>>
>>> Suddenly Flume ceased to write some files to hdfs from the other server,
>>> but not all.
>>> First it ceased with aa_error_log... (wrote that only few moments) and
>>> later after running
>>> fine for several hours it ceased to write aa_access_logs.
>>>
>>> There isn't any error messages in master, collector or agent logs. And
>>> from agent point of view
>>> it seemed that it has been delivering those files all the time (not sure
>>> how to read those logs).
>>> Seems like collector just suddenly stopped delivering those files to
>>> hdfs.
>>>
>>> It seems that collector was somehow in bad shape, since it's Jetty
>>> didn't function too well either:
>>> it opened http://localhost:35862/, but got stalled while tried to get
>>> flumeagent.jsp file.
>>>
>>> After restart of collector (on next day) it continued to write files to
>>> hdfs, but it missed all the
>>> files from past 8 hours. Also web interface worked fine.
>>>
>>> Unfortunately we don't have any logs available, since we lost them due
>>> to bug https://issues.cloudera.org/browse/FLUME-631.
>>>
>>> So, does anybody have any idea what could have caused this or do we need
>>> to wait if it happens again?
>>>
>>>
>>> Log collection was configured like this (for both aa and bb) using
>>> "flume shell -c server5 -s flume-aa.txt":
>>>
>>> cat flume-aa.txt
>>> exec map server3 aa-agent-http-fe-1
>>> exec map server3 aa-agent-http-fe-2
>>> exec map server3 aa-agent-https-fe-1
>>> exec map server3 aa-agent-https-fe-2
>>> exec map server3 aa-agent-http-error-fe-1
>>> exec map server3 aa-agent-http-error-fe-2
>>> exec map server3 aa-agent-https-error-fe-1
>>> exec map server3 aa-agent-https-error-fe-2
>>>
>>> exec map server6 aa-collector-http-fe
>>> exec map server6 aa-collector-https-fe
>>> exec map server6 aa-collector-http-error-fe
>>> exec map server6 aa-collector-https-error-fe
>>>
>>>
>>> # HTTP
>>> exec config aa-agent-http-fe-1 aa-flow-http-fe
>>> 'tailDir("/logs/aa/httpd-fe-1/", "aa_access_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-agent-http-fe-2 aa-flow-http-fe
>>> 'tailDir("/logs/aa/httpd-fe-2/", "aa_access_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>>
>>> exec config aa-collector-http-fe aa-flow-http-fe autoCollectorSource
>>> 'collectorSink("hdfs://hfds-server:8020/flume/aa/httpd-fe/%Y-%m-%d/",
>>> "%{host}-access-")'
>>>
>>> # HTTPS
>>> exec config aa-agent-https-fe-1 aa-flow-https-fe
>>> 'tailDir("/logs/aa/httpd-fe-1/", "ssl-aa_access_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-agent-https-fe-2 aa-flow-https-fe
>>> 'tailDir("/logs/aa/httpd-fe-2/", "ssl-aa_access_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>>
>>> exec config aa-collector-https-fe aa-flow-https-fe autoCollectorSource
>>> 'collectorSink("hdfs://hdfs-server:8020/flume/aa/httpd-fe/%Y-%m-%d/",
>>> "%{host}-ssl-access-")'
>>>
>>> # HTTP ERROR
>>> exec config aa-agent-http-error-fe-1 aa-flow-http-error-fe
>>> 'tailDir("/logs/aa/httpd-fe-1/", "aa_error_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-agent-http-error-fe-2 aa-flow-http-error-fe
>>> 'tailDir("/logs/aa/httpd-fe-2/", "aa_error_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-collector-http-error-fe aa-flow-http-error-fe
>>> autoCollectorSource
>>> 'collectorSink("hdfs://hdfs-server:8020/flume/aa/httpd-fe/%Y-%m-%d/",
>>> "%{host}-error-")'
>>>
>>> # HTTPS ERROR
>>> exec config aa-agent-https-error-fe-1 aa-flow-https-error-fe
>>> 'tailDir("/logs/aa/httpd-fe-1/", "ssl-aa_error_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-agent-https-error-fe-2 aa-flow-https-error-fe
>>> 'tailDir("/logs/aa/httpd-fe-2/", "ssl-aa_error_log-\\d{4}-\\d{2}-\\d{2}$",
>>> true)' autoE2EChain
>>> exec config aa-collector-https-error-fe aa-flow-https-error-fe
>>> autoCollectorSource
>>> 'collectorSink("hdfs://hdfs-server:8020/flume/aa/httpd-fe/%Y-%m-%d/",
>>> "%{host}-ssl-error-")'
>>>
>>> waitForNodesActive 0 aa-agent-http-fe-1 aa-agent-http-fe-2
>>> aa-agent-https-fe-1 aa-agent-https-fe-2 aa-agent-http-error-fe-1
>>> aa-agent-http-error-fe-2 aa-agent-https-error-fe-1
>>> aa-agent-https-error-fe-2 aa-collector-http-fe aa-collector-https-fe
>>> aa-collector-http-error-fe aa-collector-https-error-fe
>>>
>>> exec refreshAll
>>>
>>>
>>>
>>
>

Mime
View raw message