flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will McQueen <w...@cloudera.com>
Subject Re: Newbee question about flume 1.2 set up
Date Wed, 20 Jun 2012 17:12:49 GMT
Hi Bhaskar,

Are you requesting that the RollingFileSink be able to construct the file name according to
escape sequences like what can be done with HDFSEventSink? Please open a ticket with your
request.

Cheers,
Will

On Jun 20, 2012, at 9:04 AM, Bhaskar <bmarthi@gmail.com> wrote:

> I wish similar patch got applied to RollingFileSink as well.  Currently RollingFileSink
does not reconstruct the path based on place holder values.  This patch was done to HDFSEventSink.
 Do you think i should request that thru Jira?
> 
> Bhaskar
> 
> On Tue, Jun 19, 2012 at 10:59 PM, Mike Percy <mpercy@cloudera.com> wrote:
> Will just contributed an Interceptor to provide this out of the box:
> 
> https://issues.apache.org/jira/browse/FLUME-1284
> 
> Regards
> Mike
> 
> 
> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
> 
> > There is no problem from your side..Have a look at this -
> > http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
> >
> > Regards,
> > Mohammad Tariq
> >
> >
> > On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
wrote:
> > > Unfortunately, that part is not working as expected. Must be my mistake
> > > somewhere in the configuration. Here is my sink configuration.
> > >
> > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> > > agent4.sinks.svc_0_sink.rollInterval=5400
> > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > >
> > > Any thoughts on how to define host specific directory/file?
> > >
> > > Bhaskar
> > >
> > > On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)>
wrote:
> > > >
> > > > Hello Bhaskar,
> > > >
> > > > That's great...And we can use "%{host}" as the escape
> > > > sequence to prefix our filenames(am I getting you correctly???).And I
> > > > am waiting anxiously for your guide as I am still a newbie..:-)
> > > >
> > > > Regards,
> > > > Mohammad Tariq
> > > >
> > > >
> > > > On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
wrote:
> > > > > Thank you guys for the responses. I actually was able to get around
> > > > > this
> > > > > problem by tinkering around with my setting. I finally ended up with
a
> > > > > capacity of 10000 and commented out transactionCapacity (i originally
> > > > > set it
> > > > > to 10) and it started working. Thanks for the insight. It took me
a
> > > > > bit of
> > > > > time to figure out the inner workings of AVRO to get it to send data
in
> > > > > correct format. So, i got over that hump:-). Here is my flow for
POC.
> > > > >
> > > > > Host A agent --> Source tail exec --> AVRO Client Sink -->
jdbc channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>>
-F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host B agent --> Source tail exec --> AVRO Client Sink -->
jdbc channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>>
-F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host C agent --> avro-collector source --> file sink to local
directory
> > > > > --
> > > > > Memory channel
> > > > >
> > > > > The issue i am running into is, I am unable to uniquely identify
the
> > > > > source
> > > > > of the log in the sink (means the log events from Host A and Host
B are
> > > > > combined into the same log on the disk and mixed up). Is there a
way to
> > > > > provide unique identifier from the source so that we can track the
> > > > > origin of
> > > > > the log? I am hoping to see in my sink log,
> > > > >
> > > > > Host A -- some log entry
> > > > > Host B -- Some log entry etc
> > > > >
> > > > > Is this feasible or are there any alternative mechanisms to achieve
> > > > > this? I
> > > > > am putting together a new bee guide that might help answer some of
these
> > > > > questions for others as i explore this architecture.
> > > > >
> > > > > As always thanks for your assistance,
> > > > > Bhaskar
> > > > >
> > > > >
> > > > > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> > > > > <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)>
wrote:
> > > > > >
> > > > > > Hello Bhaskar,
> > > > > >
> > > > > > Using Avro is generally the recommended method to handle multi-hop
> > > > > > flows,
> > > > > > so no concerns there.
> > > > > >
> > > > > > Have you tried this setup using memory channels instead of jdbc?
Last
> > > > > > time
> > > > > > I tested it, the JDBC channel had poor throughput, so you may
be
> > > > > > getting a
> > > > > > logjam somewhere. How much data is getting entered into your
logfile?
> > > > > > Try
> > > > > > raising the capacity on your jdbc channel by a lot(10000?).
With a
> > > > > > capacity
> > > > > > of 10, if the reading side(host b) isn't polling frequently
enough,
> > > > > > there's
> > > > > > going to be problems. This is probably why you get the "failed
to
> > > > > > persist
> > > > > > event". As far as FLUME-1259 is concerned, that should only
be
> > > > > > happening if
> > > > > > bad data is being sent. You're not sending anything else to
the same
> > > > > > port
> > > > > > are you? Make sure that only the source and sink are set to
that port
> > > > > > and
> > > > > > that nothing else is.
> > > > > >
> > > > > > If the problem continues, please post a chunk of the logs leading
up to
> > > > > > the OOM error(the full trace for the cause should be enough)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 06/16/2012 12:01 AM, Bhaskar wrote:
> > > > > >
> > > > > > Sorry to be a pest with stream of questions. I think i am going
two
> > > > > > steps
> > > > > > forward and four steps back:-). After my first successful attempt,
i
> > > > > > tried
> > > > > > running flume with the following flow:
> > > > > >
> > > > > > 1. HostA
> > > > > > -- Source is tail web server log
> > > > > > -- channel jdbc
> > > > > > -- sink is AVRO collection on Host B
> > > > > > Configuraiton:
> > > > > > agent3.sources = tail
> > > > > > agent3.sinks = avro-forward-sink
> > > > > > agent3.channels = jdbc-channel
> > > > > >
> > > > > > # Define source flow
> > > > > > agent3.sources.tail.type = exec
> > > > > > agent3.sources.tail.command = tail -f /common/log/access.log
> > > > > > agent3.sources.tail.channels = jdbc-channel
> > > > > >
> > > > > > # define the flow
> > > > > > agent3.sinks.avro-forward-sink.channel = jdbc-channel
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent3.sources.avro-forward-sink.type = avro
> > > > > > agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> > > > > > agent3.sources.avro-forward-sink.port = <<PORT>>
> > > > > >
> > > > > > # Define channels
> > > > > > agent3.channels.jdbc-channel.type = jdbc
> > > > > > agent3.channels.jdbc-channel.maximum.capacity = 10
> > > > > > agent3.channels.jdbc-channel.maximum.connections = 2
> > > > > >
> > > > > >
> > > > > > 2. HostB
> > > > > > -- Source is AVRO collection
> > > > > > -- channel is memory
> > > > > > -- sink is local file system
> > > > > >
> > > > > > Configuration:
> > > > > > # list sources, sinks and channels in the agent4
> > > > > > agent4.sources = avro-collection-source
> > > > > > agent4.sinks = svc_0_sink
> > > > > > agent4.channels = MemoryChannel-2
> > > > > >
> > > > > > # define the flow
> > > > > > agent4.sources.avro-collection-source.channels = MemoryChannel-2
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent4.sources.avro-collection-source.type = avro
> > > > > > agent4.sources.avro-collection-source.bind = <<IP Address>>
> > > > > > agent4.sources.avro-collection-source.port = <<PORT>>
> > > > > >
> > > > > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> > > > > > agent4.sinks.svc_0_sink.rollInterval=600
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > agent4.channels.MemoryChannel-2.type = memory
> > > > > > agent4.channels.MemoryChannel-2.capacity = 100
> > > > > > agent4.channels.MemoryChannel-2.transactionCapacity = 10
> > > > > >
> > > > > >
> > > > > > Basically i am trying to tail a file on one host, stream it
to another
> > > > > > host running sink. During the trial run, the configuration is
loaded
> > > > > > fine
> > > > > > and i see the channels created fine. I see an exception from
the jdbc
> > > > > > channel first (Failed to persist event). I am getting a java
heap
> > > > > > space OOM
> > > > > > exception from Host B when Host A attempts to write.
> > > > > >
> > > > > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception
from
> > > > > > downstream.
> > > > > > java.lang.OutOfMemoryError: Java heap space
> > > > > >
> > > > > > This issue was already
> > > > > > reported https://issues.apache.org/jira/browse/FLUME-1259 but
i am not
> > > > > > sure
> > > > > > if there is a work around to this problem. I have couple questions:
> > > > > >
> > > > > > 1. Am i force fitting a wrong solution here using AVRO?
> > > > > > 2. if so, what would be a right solution for streaming data
from Host
> > > > > > A
> > > > > > to Host B (or thru intermediaries)?
> > > > > >
> > > > > > Thanks,
> > > > > > Bhaskar
> > > > > >
> > > > > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <dontariq@gmail.com
(mailto:dontariq@gmail.com)>
> > > > > > wrote:
> > > > > > >
> > > > > > > Since you are thinking of using multi-hop flow I would
suggest to go
> > > > > > > for "JDBC Channel" as there is higher chance of error than
single-hop
> > > > > > > flow and in JDBC Channel events are stored in a persistent
storage
> > > > > > > that’s backed by a database. For detailed guidelines
you can refer
> > > > > > > Flume 1.x User Guide at -
> > > > > > >
> > > > > > >
> > > > > > > https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> > > > > > >
> > > > > > > Regards,
> > > > > > > Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com
(mailto:bmarthi@gmail.com)> wrote:
> > > > > > > > Hi Mohammad,
> > > > > > > > Thanks for the pointer there. Do you think using a
message queue
> > > > > > > > (like
> > > > > > > > rabbitmq) would be a choice of communication channel
between each
> > > > > > > > hop?
> > > > > > > > i am
> > > > > > > > struggling to get a handle on how i need to configure
my sink in
> > > > > > > > intermediary hops in a multi-hop flow. Appreciate
any
> > > > > > > > guidance/examples.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bhaskar
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <dontariq@gmail.com
(mailto:dontariq@gmail.com)>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hello Bhaskar,
> > > > > > > > >
> > > > > > > > > That's great..And the best approach to stream
logs depends
> > > > > > > > > upon
> > > > > > > > > the type of source you want to watch for..And
by looking at your
> > > > > > > > > usecase, I would suggest to go for "multi-hop"
flows where events
> > > > > > > > > travel through multiple agents before reaching
the final
> > > > > > > > > destination.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Mohammad Tariq
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmarthi@gmail.com
(mailto:bmarthi@gmail.com)>
> > > > > > > > > wrote:
> > > > > > > > > > I know what i am missing:-) I missed connecting
the sink with
> > > > > > > > > > the
> > > > > > > > > > channel.
> > > > > > > > > > My small POC works now and i am able to
view the streamed logs.
> > > > > > > > > > Thank
> > > > > > > > > > you
> > > > > > > > > > all for the guidance and patience in answering
all questions.
> > > > > > > > > > So,
> > > > > > > > > > whats
> > > > > > > > > > the
> > > > > > > > > > best approach to stream logs from other
hosts? Basically my next
> > > > > > > > > > task
> > > > > > > > > > would
> > > > > > > > > > be to set up collector (sort of) model to
stream logs to
> > > > > > > > > > intermediary
> > > > > > > > > > and
> > > > > > > > > > then stream from collector to a sink location.
I'd appreciate
> > > > > > > > > > any
> > > > > > > > > > thoughts/guidance in this regard.
> > > > > > > > > >
> > > > > > > > > > Bhaskar
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar
<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > For testing purposes, I tried with
the following configuration
> > > > > > > > > > > without
> > > > > > > > > > > much luck. I see that the process started
fine but it just does
> > > > > > > > > > > not
> > > > > > > > > > > write
> > > > > > > > > > > anything to the sink. I guess i am
missing something here. Can
> > > > > > > > > > > one of
> > > > > > > > > > > you
> > > > > > > > > > > gurus take a look and suggest what
i am doing wrong?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources = tail
> > > > > > > > > > > agent1.channels = MemoryChannel-2
> > > > > > > > > > > agent1.sinks = svc_0_sink
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources.tail.type = exec
> > > > > > > > > > > agent1.sources.tail.command = tail
-f /var/log/access.log
> > > > > > > > > > > agent1.sources.tail.channels = MemoryChannel-2
> > > > > > > > > > >
> > > > > > > > > > > agent1.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > > > > > > agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> > > > > > > > > > > agent1.sinks.svc_0_sink.rollInterval=0
> > > > > > > > > > >
> > > > > > > > > > > agent1.channels.MemoryChannel-2.type
= memory
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jun 14, 2012 at 4:26 AM, Guillaume
Polaert
> > > > > > > > > > > <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Bhaskar,
> > > > > > > > > > > >
> > > > > > > > > > > > This is the flume.conf (http://pastebin.com/WULgUuaf)
what I'm
> > > > > > > > > > > > using.
> > > > > > > > > > > > I have an avro server on the hadoop-m
host and one agent per
> > > > > > > > > > > > node
> > > > > > > > > > > > (slave
> > > > > > > > > > > > hosts). Each agent send the ouput
of a exec command to avro
> > > > > > > > > > > > server.
> > > > > > > > > > > >
> > > > > > > > > > > > Host1 : exec -> memory ->
avro (sink)
> > > > > > > > > > > >
> > > > > > > > > > > > Host2 : exec -> memory ->
avro
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > MainHost :
> > > > > > > > > > > > avro
> > > > > > > > > > > > (source) -> memory -> rolling
file (local FS)
> > > > > > > > > > > > ...
> > > > > > > > > > > >
> > > > > > > > > > > > Host3 : exec -> memory ->
avro
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Use your own exec command to read
Apache log.
> > > > > > > > > > > >
> > > > > > > > > > > > Guillaume Polaert | Cyrès Conseil
> > > > > > > > > > > >
> > > > > > > > > > > > De : Bhaskar [mailto:bmarthi@gmail.com]
> > > > > > > > > > > > Envoyé : mercredi 13 juin 2012
19:16
> > > > > > > > > > > > À : flume-user@incubator.apache.org
(mailto:flume-user@incubator.apache.org)
> > > > > > > > > > > > Objet : Newbee question about
flume 1.2 set up
> > > > > > > > > > > >
> > > > > > > > > > > > Good Afternoon,
> > > > > > > > > > > > I am a newbee to flume and read
thru limited documentation
> > > > > > > > > > > > available.
> > > > > > > > > > > > I
> > > > > > > > > > > > would like to set up the following
to test out.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Read apache access logs (as
source)
> > > > > > > > > > > > 2. Use memory channel
> > > > > > > > > > > > 3. Write it to a NFS (or even
local) file system
> > > > > > > > > > > >
> > > > > > > > > > > > Can some one help me with the
necessary configuration. I am
> > > > > > > > > > > > having
> > > > > > > > > > > > difficult time to glean that information
from available
> > > > > > > > > > > > documentation.
> > > > > > > > > > > > I am
> > > > > > > > > > > > sure someone has done such test
before and i appreciate if you
> > > > > > > > > > > > can
> > > > > > > > > > > > pass on
> > > > > > > > > > > > that information. Secondly, I
also would like to stream the
> > > > > > > > > > > > logs
> > > > > > > > > > > > to a
> > > > > > > > > > > > remote server. Is that a log4j
configuration or do i need to
> > > > > > > > > > > > run
> > > > > > > > > > > > an
> > > > > > > > > > > > agent
> > > > > > > > > > > > on each host to do so? Any configuration
examples would be of
> > > > > > > > > > > > great
> > > > > > > > > > > > help.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 
> 
> 
> 

Mime
View raw message