flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khurana <skhurana...@gmail.com>
Subject Re: performances tuning...
Date Wed, 03 Sep 2014 07:38:49 GMT
I see that you have below settings set to zero. You dont want rolling to
hdfs to  happen based upon any of the size, count or time  interval?

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0


On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
sebastiano.dipaola@gmail.com> wrote:

> Hi Paul,
> thank for your answer.
> As I' m a newbie of Flume How can I attach multiple sinks to the same
> channel? (does they read data in a round robin fashon from the memory
> channel?)
> (does this create multiple files on the hdfs?, because this is not what
> I'm expecting to have I have a 500MB data file at the source and I would
> like to have only one file on HDFS)
>
> I can't believe that I cannot achieve such a performance with a single
> sink. I'm pretty sure it's a configuration issue!
> Beside this how to tune the batchSize parameter? (Of course I have already
> tried to set it like 10 times the number I have in my config, but no
> relevant improvements)
> Regards.
> Seba
>
>
> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pchavez@ntent.com> wrote:
>
>>  Start adding additional HDFS sinks attached to the same channel. You
>> can also tune batch sizes when writing to HDFS to increase per sink
>> performance.
>>
>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>> sebastiano.dipaola@gmail.com> wrote:
>>
>>   Hi there,
>> I'm a completely newbie of Flume, so I probably made a mistake in my
>> configuration but I cannot point it out.
>> I want to achieve transfer maximum performances.
>> My flume machine has 16GB RAM and 8 Cores
>> I'm using a very simple Flume architecture:
>> Source -> Memory Channel -> Sink
>> Source is of type netcat
>> and Sink is hdfs
>> The machine has 1Gb ethernet directly connected to the switch of the
>> hadoop cluster.
>> The point is that Flume is sooo slow in loading the data into my hdfs
>> filesystem.
>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
>> the same machine I will reach approx 250 Mb/s as transfer rate, while
>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>> (the cluster is composed of 10 machines, and was totally idle while I did
>> this test, so was not under stress) (the traffic rate was measured on the
>> flume machine output interface in both exeperiments)
>> (myfile has 10 million of lines of average size of 150 bytes each)
>>
>>  For what I understood till now It doesn't seem a source issue as the
>> memory channel tends to fill up if I decrease the channel capacity (but
>> even make it very very very very big it does not affect sink perfomances),
>> so it seems to me that the problem is related to sink.
>> In order to test this point I've also tried to change the source using
>> "exec" type and simply executing "cat myfile"  but the result hasn't
>> changed....
>>
>>
>>  Here's my used config...
>>
>>   # list the sources, sinks and channels for the agent
>> test.sources = r1
>> test.channels = c1
>>  test.sinks = s1
>>
>>  # exec attempt
>> test.sources.r1.type = exec
>> test.sources.r1.command = cat /tmp/myfile
>>
>>  # my netcat attempt
>> #test.sources.r1.type = netcat
>> #test.sources.r1.bind = localhost
>> #test.sources.r1.port = 6666
>>
>>  # my file channel attempt
>> #test.channels.c1.type = file
>>
>> #my memory channel attempt
>> test.channels.c1.type = memory
>> test.channels.c1.capacity = 1000000
>> test.channels.c1.transactionCapacity = 10000
>>
>>  # how to properly set those parameter?? even if I enable those nothing
>> changes
>> # in my performances (what it the buffer percentage used for?)
>> #test.channels.c1.byteCapacityBufferPercentage = 50
>> #test.channels.c1.byteCapacity = 100000000
>>
>>  # set channel for source
>> test.sources.r1.channels = c1
>> # set channel for sink
>> test.sinks.s1.channel = c1
>>
>>  test.sinks.s1.type = hdfs
>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>
>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>> test.sinks.s1.hdfs.filePrefix = log-data
>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>
>>  # how to set this parameter??? (i basically want to send as much data
>> as I can)
>> test.sinks.s1.hdfs.batchSize = 10000
>>
>> #test.sinks.s1.hdfs.round = true
>> #test.sinks.s1.hdfs.roundValue = 5
>> #test.sinks.s1.hdfs.roundUnit = minute
>>
>> test.sinks.s1.hdfs.rollSize = 0
>> test.sinks.s1.hdfs.rollCount = 0
>> test.sinks.s1.hdfs.rollInterval = 0
>>
>> # compression attempt
>> #test.sinks.s1.hdsf.fileType = CompressedStream
>> #test.sinks.s1.hdfs.codeC=gzip
>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>> #test.sinks.s1.hdfs.callTimeout = 120000
>>
>>  Can someone show me how to find this bottleneck/ configuration mistake?
>> (I can't believe that those are flume performance on my machine)
>>
>>  Thanks a lot if you can help me
>> Regards.
>> Sebastiano
>>
>>
>>
>


-- 
Thanks and regards
Sandeep Khurana

Mime
View raw message