flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastiano Di Paola <sebastiano.dipa...@gmail.com>
Subject Re: performances tuning...
Date Wed, 03 Sep 2014 09:18:21 GMT
I raised batchSize of a 100 factor,  added more heap space and speed
increased...
still not reached the same speed as using "hdfs dfs -copyFromLocal" but I'm
pretty sure it's a tuning problem.
thanks a lot for your hint.
Regards
Seba


On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <skhurana333@gmail.com>
wrote:

> Since you mentioned "average size of 150 bytes each" is your each record,
> I will try increasing the batch size to a higher value.
>
>
> "HDFS batch size determines the number of events to take from the channel and
> send in one go."
>
> So in 1 shot you are sending 1500000 bytes to hdfs.
>
>
> On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola <
> sebastiano.dipaola@gmail.com> wrote:
>
>> In my experiment, I just want to transfer a single file...just to test
>> what performances I can achieve...
>> so rolling file on hdfs at this point is not vital.
>> Anyway I did some test rolling file every 300 seconds.
>> What I can't explain to myself is the "slow" output from the sink...the
>> memory channel overflows (if it's not big enough so it seems that the souce
>> is able to produce a higher data rate than the data rate the sink is able
>> to process and send on hdfs)
>> I'm not sure if it can helps to pinpoint my "configuration mistake", but
>> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
>> Regards.
>> Seba
>>
>>
>> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana333@gmail.com>
>> wrote:
>>
>>> I see that you have below settings set to zero. You dont want rolling to
>>> hdfs to  happen based upon any of the size, count or time  interval?
>>>
>>> test.sinks.s1.hdfs.rollSize = 0
>>> test.sinks.s1.hdfs.rollCount = 0
>>> test.sinks.s1.hdfs.rollInterval = 0
>>>
>>>
>>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
>>> sebastiano.dipaola@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>> thank for your answer.
>>>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>>>> channel? (does they read data in a round robin fashon from the memory
>>>> channel?)
>>>>  (does this create multiple files on the hdfs?, because this is not
>>>> what I'm expecting to have I have a 500MB data file at the source and I
>>>> would like to have only one file on HDFS)
>>>>
>>>> I can't believe that I cannot achieve such a performance with a single
>>>> sink. I'm pretty sure it's a configuration issue!
>>>> Beside this how to tune the batchSize parameter? (Of course I have
>>>> already tried to set it like 10 times the number I have in my config, but
>>>> no relevant improvements)
>>>> Regards.
>>>> Seba
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pchavez@ntent.com> wrote:
>>>>
>>>>>  Start adding additional HDFS sinks attached to the same channel. You
>>>>> can also tune batch sizes when writing to HDFS to increase per sink
>>>>> performance.
>>>>>
>>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>>>> sebastiano.dipaola@gmail.com> wrote:
>>>>>
>>>>>   Hi there,
>>>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>>>> configuration but I cannot point it out.
>>>>> I want to achieve transfer maximum performances.
>>>>> My flume machine has 16GB RAM and 8 Cores
>>>>> I'm using a very simple Flume architecture:
>>>>> Source -> Memory Channel -> Sink
>>>>> Source is of type netcat
>>>>> and Sink is hdfs
>>>>> The machine has 1Gb ethernet directly connected to the switch of the
>>>>> hadoop cluster.
>>>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>>>> filesystem.
>>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile
>>>>> from the same machine I will reach approx 250 Mb/s as transfer rate,
while
>>>>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>>>>> (the cluster is composed of 10 machines, and was totally idle while I
did
>>>>> this test, so was not under stress) (the traffic rate was measured on
the
>>>>> flume machine output interface in both exeperiments)
>>>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>>>
>>>>>  For what I understood till now It doesn't seem a source issue as the
>>>>> memory channel tends to fill up if I decrease the channel capacity (but
>>>>> even make it very very very very big it does not affect sink perfomances),
>>>>> so it seems to me that the problem is related to sink.
>>>>> In order to test this point I've also tried to change the source using
>>>>> "exec" type and simply executing "cat myfile"  but the result hasn't
>>>>> changed....
>>>>>
>>>>>
>>>>>  Here's my used config...
>>>>>
>>>>>   # list the sources, sinks and channels for the agent
>>>>> test.sources = r1
>>>>> test.channels = c1
>>>>>  test.sinks = s1
>>>>>
>>>>>  # exec attempt
>>>>> test.sources.r1.type = exec
>>>>> test.sources.r1.command = cat /tmp/myfile
>>>>>
>>>>>  # my netcat attempt
>>>>> #test.sources.r1.type = netcat
>>>>> #test.sources.r1.bind = localhost
>>>>> #test.sources.r1.port = 6666
>>>>>
>>>>>  # my file channel attempt
>>>>> #test.channels.c1.type = file
>>>>>
>>>>> #my memory channel attempt
>>>>> test.channels.c1.type = memory
>>>>> test.channels.c1.capacity = 1000000
>>>>> test.channels.c1.transactionCapacity = 10000
>>>>>
>>>>>  # how to properly set those parameter?? even if I enable those
>>>>> nothing changes
>>>>> # in my performances (what it the buffer percentage used for?)
>>>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>>>> #test.channels.c1.byteCapacity = 100000000
>>>>>
>>>>>  # set channel for source
>>>>> test.sources.r1.channels = c1
>>>>> # set channel for sink
>>>>> test.sinks.s1.channel = c1
>>>>>
>>>>>  test.sinks.s1.type = hdfs
>>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>>>
>>>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>>>> test.sinks.s1.hdfs.filePrefix = log-data
>>>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>>>
>>>>>  # how to set this parameter??? (i basically want to send as much
>>>>> data as I can)
>>>>> test.sinks.s1.hdfs.batchSize = 10000
>>>>>
>>>>> #test.sinks.s1.hdfs.round = true
>>>>> #test.sinks.s1.hdfs.roundValue = 5
>>>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>>>
>>>>> test.sinks.s1.hdfs.rollSize = 0
>>>>> test.sinks.s1.hdfs.rollCount = 0
>>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>>
>>>>> # compression attempt
>>>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>>>> #test.sinks.s1.hdfs.codeC=gzip
>>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>>>
>>>>>  Can someone show me how to find this bottleneck/ configuration
>>>>> mistake? (I can't believe that those are flume performance on my machine)
>>>>>
>>>>>  Thanks a lot if you can help me
>>>>> Regards.
>>>>> Sebastiano
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and regards
>>> Sandeep Khurana
>>>
>>
>>
>
>
> --
> Thanks and regards
> Sandeep Khurana
>

Mime
View raw message