Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70D741174F for ; Wed, 3 Sep 2014 07:36:48 +0000 (UTC) Received: (qmail 20679 invoked by uid 500); 3 Sep 2014 07:36:48 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 20632 invoked by uid 500); 3 Sep 2014 07:36:48 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 20622 invoked by uid 99); 3 Sep 2014 07:36:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Sep 2014 07:36:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,HTML_OBFUSCATE_05_10,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sebastiano.dipaola@gmail.com designates 209.85.192.49 as permitted sender) Received: from [209.85.192.49] (HELO mail-qg0-f49.google.com) (209.85.192.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Sep 2014 07:36:43 +0000 Received: by mail-qg0-f49.google.com with SMTP id j107so7864932qga.36 for ; Wed, 03 Sep 2014 00:36:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cvBlR2B0K8q5hWijEcDN1VucDe5AWuJOXxDYmhNMpp8=; b=UEmLm3ODwuyqA/g6hRUAxw3ZHqxPRpTwq/v+u7o+qOt5Lk7hVynVyS6RYkZ9YZPBLD gcQt7LyOo/3eJjyT68npn4j29H513h1TNPqYu2IU5YPco40oA5xxzy1Kk9yBhrAPNQES 7j49IorGHa7WTcuxuZ7ARB2a6o7TCg8/VMdRq6056nhQ/od+0LcJLgBJyAYtxYHmPKRD Zv1c9JWX4M04z/BnRCel4g4VLOEE1/vaeVxr14TWlr+v9SA0Zb/BoI2iuslqeaI/Rxfm 0cG+RcwpbU+2CFgik+9CavVyvSst1n0At1J01vWuYW3KreKyJM+OVOBlkG4MfWXtxILu Sicg== MIME-Version: 1.0 X-Received: by 10.224.92.83 with SMTP id q19mr61878037qam.29.1409729782610; Wed, 03 Sep 2014 00:36:22 -0700 (PDT) Received: by 10.96.45.130 with HTTP; Wed, 3 Sep 2014 00:36:22 -0700 (PDT) In-Reply-To: <8B6EC3F8-B43F-4510-81D0-A769CA7BECBB@ntent.com> References: <8B6EC3F8-B43F-4510-81D0-A769CA7BECBB@ntent.com> Date: Wed, 3 Sep 2014 09:36:22 +0200 Message-ID: Subject: Re: performances tuning... From: Sebastiano Di Paola To: user@flume.apache.org Content-Type: multipart/alternative; boundary=089e0149cf94ec002c0502244a38 X-Virus-Checked: Checked by ClamAV on apache.org --089e0149cf94ec002c0502244a38 Content-Type: text/plain; charset=UTF-8 Hi Paul, thank for your answer. As I' m a newbie of Flume How can I attach multiple sinks to the same channel? (does they read data in a round robin fashon from the memory channel?) (does this create multiple files on the hdfs?, because this is not what I'm expecting to have I have a 500MB data file at the source and I would like to have only one file on HDFS) I can't believe that I cannot achieve such a performance with a single sink. I'm pretty sure it's a configuration issue! Beside this how to tune the batchSize parameter? (Of course I have already tried to set it like 10 times the number I have in my config, but no relevant improvements) Regards. Seba On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez wrote: > Start adding additional HDFS sinks attached to the same channel. You can > also tune batch sizes when writing to HDFS to increase per sink performance. > > On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < > sebastiano.dipaola@gmail.com> wrote: > > Hi there, > I'm a completely newbie of Flume, so I probably made a mistake in my > configuration but I cannot point it out. > I want to achieve transfer maximum performances. > My flume machine has 16GB RAM and 8 Cores > I'm using a very simple Flume architecture: > Source -> Memory Channel -> Sink > Source is of type netcat > and Sink is hdfs > The machine has 1Gb ethernet directly connected to the switch of the > hadoop cluster. > The point is that Flume is sooo slow in loading the data into my hdfs > filesystem. > (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from > the same machine I will reach approx 250 Mb/s as transfer rate, while > transferring the same file with this Flume architecture is like 2-3 Mb/s). > (the cluster is composed of 10 machines, and was totally idle while I did > this test, so was not under stress) (the traffic rate was measured on the > flume machine output interface in both exeperiments) > (myfile has 10 million of lines of average size of 150 bytes each) > > For what I understood till now It doesn't seem a source issue as the > memory channel tends to fill up if I decrease the channel capacity (but > even make it very very very very big it does not affect sink perfomances), > so it seems to me that the problem is related to sink. > In order to test this point I've also tried to change the source using > "exec" type and simply executing "cat myfile" but the result hasn't > changed.... > > > Here's my used config... > > # list the sources, sinks and channels for the agent > test.sources = r1 > test.channels = c1 > test.sinks = s1 > > # exec attempt > test.sources.r1.type = exec > test.sources.r1.command = cat /tmp/myfile > > # my netcat attempt > #test.sources.r1.type = netcat > #test.sources.r1.bind = localhost > #test.sources.r1.port = 6666 > > # my file channel attempt > #test.channels.c1.type = file > > #my memory channel attempt > test.channels.c1.type = memory > test.channels.c1.capacity = 1000000 > test.channels.c1.transactionCapacity = 10000 > > # how to properly set those parameter?? even if I enable those nothing > changes > # in my performances (what it the buffer percentage used for?) > #test.channels.c1.byteCapacityBufferPercentage = 50 > #test.channels.c1.byteCapacity = 100000000 > > # set channel for source > test.sources.r1.channels = c1 > # set channel for sink > test.sinks.s1.channel = c1 > > test.sinks.s1.type = hdfs > test.sinks.s1.hdfs.useLocalTimeStamp = true > > test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* > test.sinks.s1.hdfs.filePrefix = log-data > test.sinks.s1.hdfs.inUseSuffix = .dat > > # how to set this parameter??? (i basically want to send as much data as > I can) > test.sinks.s1.hdfs.batchSize = 10000 > > #test.sinks.s1.hdfs.round = true > #test.sinks.s1.hdfs.roundValue = 5 > #test.sinks.s1.hdfs.roundUnit = minute > > test.sinks.s1.hdfs.rollSize = 0 > test.sinks.s1.hdfs.rollCount = 0 > test.sinks.s1.hdfs.rollInterval = 0 > > # compression attempt > #test.sinks.s1.hdsf.fileType = CompressedStream > #test.sinks.s1.hdfs.codeC=gzip > #test.sinks.s1.hdfs.codeC=BZip2Codec > #test.sinks.s1.hdfs.callTimeout = 120000 > > Can someone show me how to find this bottleneck/ configuration mistake? > (I can't believe that those are flume performance on my machine) > > Thanks a lot if you can help me > Regards. > Sebastiano > > > --089e0149cf94ec002c0502244a38 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Paul,
thank for your answer.
As I' m = a newbie of Flume How can I attach multiple sinks to the same channel? (doe= s they read data in a round robin fashon from the memory channel?)
(does this create multiple files on the hdfs?, because this is not what I&#= 39;m expecting to have I have a 500MB data file at the source and I would l= ike to have only one file on HDFS)

I can't bel= ieve that I cannot achieve such a performance with a single sink. I'm p= retty sure it's a configuration issue!
Beside this how to tune the batchSize parameter? (Of course I have alr= eady tried to set it like 10 times the number I have in my config, but no r= elevant improvements)
Regards.
Seba


On Wed, Sep 3, 2014 at 9:11 AM, Paul Cha= vez <pchavez@ntent.com> wrote:
Start adding additional HDFS sinks attached to the same channel. You c= an also tune batch sizes when writing to HDFS to increase per sink performa= nce.

On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <sebastiano.dipaola@= gmail.com> wrote:

Hi there,
I'm a completely newbie of Flume, so I probably made a mistake in = my configuration but I cannot point it out.
I want to achieve transfer maximum performances.
My flume machine has 16GB RAM and 8 Cores
I'm using a very simple Flume architecture:
Source -> Memory Channel -> Sink
Source is of type netcat
and Sink is hdfs
The machine has 1Gb ethernet directly connected to the switch of the h= adoop cluster.
The point is that Flume is sooo slow in loading the data into my hdfs = filesystem.
(i.e. using hdfs dfs -copyFromLocal myfile /flume/events/myfile= from the same machine I will reach approx 250=C2=A0Mb/s as transfer rate, = while transferring the same file with this Flume architecture is like 2-3 M= b/s). (the cluster is composed of 10 machines, and was totally idle while I did this test, so was not under str= ess) (the traffic rate was measured on the flume machine output interface i= n both exeperiments)
(myfile has 10 million of lines of average size of 150 bytes each)

For what I understood till now It doesn't seem a source issue as t= he memory channel tends to fill up if I decrease the channel capacity (but = even make it very very very very big it does not affect sink perfomances), = so it seems to me that the problem is related to sink.
In order to test this point I've also tried to change the source u= sing "exec" type and simply executing "cat myfile" =C2= =A0but the result hasn't changed....


Here's my used config...

=C2=A0= # list the sources, sinks and channels for the agent
test.sources =3D r1
test.channels =3D c1

test.sinks =3D s1

# exec attempt
test.sourc= es.r1.ty= pe =3D exec
test.sources.r1.command= =3D cat /tmp/myfile

# my netcat attempt
#test.sour= ces.r1.t= ype =3D netcat
#test.sources.r1.bind = =3D localhost
#test.sources.r1.port = =3D 6666

# my file channel attempt
#test.chan= nels.c1.= type =3D file

#my memory channel attempt
test.channels.c1.type = =3D memory
test.channels.c1.capaci= ty =3D 1000000
test.channels.c1.transa= ctionCap= acity =3D 10000

# how to properly set those parameter?? even if I enable those nothing= changes
# in my performances (what it the buffer percentage used for?)
#test.chan= nels.c1.= byteCapa= cityBufferPercentage =3D 50
#test.channels.c1.byteC= apacity = =3D 100000000

# set chan= nel for source
test.sources.r1.channel= s =3D c1
# set channel f= or sink
test.sinks.s1.channel = =3D c1

test.sinks= .s1.type= =3D hdfs
test.sinks.s1.hdfs.useL= ocalTime= Stamp =3D true

test.sinks.s1.hdfs.path =3D hdfs://mynodemanager:9000/flume/events/
test.sinks.s1.hdfs.filePrefix =3D log-data
test= .sinks.s1.hdfs.inUseSuffix =3D .dat

# ho= w to set this parameter??? (i basically want to send as much data as I can)=
test.sinks.s1.hdfs.batchSize =3D 10000

#test.sinks.s1.hdfs.round =3D true
#test.sinks.s1.hdfs.roundValue =3D 5
#test.sinks.s1.hdfs.roundUnit =3D minute

test.sinks.s1.hdfs.rollSize =3D 0
test.sinks.s1.hdfs.rollCount =3D 0
test.sinks.s1.hdfs.rollInterval =3D 0

# compression attempt
#test.sinks.s1.hdsf.fileType =3D CompressedStream
#test.sinks.s1.hdfs.codeC=3Dgzip
#test.sinks.s1.hdfs.codeC=3DBZip2Codec
#test.sinks.s1.hdfs.callTimeout =3D 120000

Can someone show me how to find this bottleneck/ configuration mistake= ? (I can't believe that those are flume performance on my machine)

Thanks a lot if you can help me
Regards.
Sebastiano
=C2=A0

--089e0149cf94ec002c0502244a38--