flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastiano Di Paola <sebastiano.dipa...@gmail.com>
Subject Re: performances tuning...
Date Wed, 03 Sep 2014 07:36:22 GMT
Hi Paul,
thank for your answer.
As I' m a newbie of Flume How can I attach multiple sinks to the same
channel? (does they read data in a round robin fashon from the memory
channel?)
(does this create multiple files on the hdfs?, because this is not what I'm
expecting to have I have a 500MB data file at the source and I would like
to have only one file on HDFS)

I can't believe that I cannot achieve such a performance with a single
sink. I'm pretty sure it's a configuration issue!
Beside this how to tune the batchSize parameter? (Of course I have already
tried to set it like 10 times the number I have in my config, but no
relevant improvements)
Regards.
Seba


On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pchavez@ntent.com> wrote:

>  Start adding additional HDFS sinks attached to the same channel. You can
> also tune batch sizes when writing to HDFS to increase per sink performance.
>
> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
> sebastiano.dipaola@gmail.com> wrote:
>
>   Hi there,
> I'm a completely newbie of Flume, so I probably made a mistake in my
> configuration but I cannot point it out.
> I want to achieve transfer maximum performances.
> My flume machine has 16GB RAM and 8 Cores
> I'm using a very simple Flume architecture:
> Source -> Memory Channel -> Sink
> Source is of type netcat
> and Sink is hdfs
> The machine has 1Gb ethernet directly connected to the switch of the
> hadoop cluster.
> The point is that Flume is sooo slow in loading the data into my hdfs
> filesystem.
> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
> the same machine I will reach approx 250 Mb/s as transfer rate, while
> transferring the same file with this Flume architecture is like 2-3 Mb/s).
> (the cluster is composed of 10 machines, and was totally idle while I did
> this test, so was not under stress) (the traffic rate was measured on the
> flume machine output interface in both exeperiments)
> (myfile has 10 million of lines of average size of 150 bytes each)
>
>  For what I understood till now It doesn't seem a source issue as the
> memory channel tends to fill up if I decrease the channel capacity (but
> even make it very very very very big it does not affect sink perfomances),
> so it seems to me that the problem is related to sink.
> In order to test this point I've also tried to change the source using
> "exec" type and simply executing "cat myfile"  but the result hasn't
> changed....
>
>
>  Here's my used config...
>
>   # list the sources, sinks and channels for the agent
> test.sources = r1
> test.channels = c1
>  test.sinks = s1
>
>  # exec attempt
> test.sources.r1.type = exec
> test.sources.r1.command = cat /tmp/myfile
>
>  # my netcat attempt
> #test.sources.r1.type = netcat
> #test.sources.r1.bind = localhost
> #test.sources.r1.port = 6666
>
>  # my file channel attempt
> #test.channels.c1.type = file
>
> #my memory channel attempt
> test.channels.c1.type = memory
> test.channels.c1.capacity = 1000000
> test.channels.c1.transactionCapacity = 10000
>
>  # how to properly set those parameter?? even if I enable those nothing
> changes
> # in my performances (what it the buffer percentage used for?)
> #test.channels.c1.byteCapacityBufferPercentage = 50
> #test.channels.c1.byteCapacity = 100000000
>
>  # set channel for source
> test.sources.r1.channels = c1
> # set channel for sink
> test.sinks.s1.channel = c1
>
>  test.sinks.s1.type = hdfs
> test.sinks.s1.hdfs.useLocalTimeStamp = true
>
>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
> test.sinks.s1.hdfs.filePrefix = log-data
> test.sinks.s1.hdfs.inUseSuffix = .dat
>
>  # how to set this parameter??? (i basically want to send as much data as
> I can)
> test.sinks.s1.hdfs.batchSize = 10000
>
> #test.sinks.s1.hdfs.round = true
> #test.sinks.s1.hdfs.roundValue = 5
> #test.sinks.s1.hdfs.roundUnit = minute
>
> test.sinks.s1.hdfs.rollSize = 0
> test.sinks.s1.hdfs.rollCount = 0
> test.sinks.s1.hdfs.rollInterval = 0
>
> # compression attempt
> #test.sinks.s1.hdsf.fileType = CompressedStream
> #test.sinks.s1.hdfs.codeC=gzip
> #test.sinks.s1.hdfs.codeC=BZip2Codec
> #test.sinks.s1.hdfs.callTimeout = 120000
>
>  Can someone show me how to find this bottleneck/ configuration mistake?
> (I can't believe that those are flume performance on my machine)
>
>  Thanks a lot if you can help me
> Regards.
> Sebastiano
>
>
>

Mime
View raw message