Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of sebastiano.dipaola@gmail.com
 designates 209.85.192.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <8B6EC3F8-B43F-4510-81D0-A769CA7BECBB@ntent.com>
References: 
 <CACLTFx7J1kbUXivBk0=7GYpBe2nnVeU1_8=LiD10HYi8OiJgsg@mail.gmail.com>
	<8B6EC3F8-B43F-4510-81D0-A769CA7BECBB@ntent.com>
Date: Wed, 3 Sep 2014 09:36:22 +0200
Message-ID: 
 <CACLTFx4PboA90YGMbt=iJn1v2Uz6m6u5WHhxzpViCE-4Knyq=Q@mail.gmail.com>
Subject: Re: performances tuning...
From: Sebastiano Di Paola <sebastiano.dipaola@gmail.com>
To: user@flume.apache.org
Content-Type: multipart/alternative; boundary=089e0149cf94ec002c0502244a38

--089e0149cf94ec002c0502244a38
Content-Type: text/plain; charset=UTF-8

Hi Paul,
thank for your answer.
As I' m a newbie of Flume How can I attach multiple sinks to the same
channel? (does they read data in a round robin fashon from the memory
channel?)
(does this create multiple files on the hdfs?, because this is not what I'm
expecting to have I have a 500MB data file at the source and I would like
to have only one file on HDFS)

I can't believe that I cannot achieve such a performance with a single
sink. I'm pretty sure it's a configuration issue!
Beside this how to tune the batchSize parameter? (Of course I have already
tried to set it like 10 times the number I have in my config, but no
relevant improvements)
Regards.
Seba


On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pchavez@ntent.com> wrote:

>  Start adding additional HDFS sinks attached to the same channel. You can
> also tune batch sizes when writing to HDFS to increase per sink performance.
>
> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
> sebastiano.dipaola@gmail.com> wrote:
>
>   Hi there,
> I'm a completely newbie of Flume, so I probably made a mistake in my
> configuration but I cannot point it out.
> I want to achieve transfer maximum performances.
> My flume machine has 16GB RAM and 8 Cores
> I'm using a very simple Flume architecture:
> Source -> Memory Channel -> Sink
> Source is of type netcat
> and Sink is hdfs
> The machine has 1Gb ethernet directly connected to the switch of the
> hadoop cluster.
> The point is that Flume is sooo slow in loading the data into my hdfs
> filesystem.
> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
> the same machine I will reach approx 250 Mb/s as transfer rate, while
> transferring the same file with this Flume architecture is like 2-3 Mb/s).
> (the cluster is composed of 10 machines, and was totally idle while I did
> this test, so was not under stress) (the traffic rate was measured on the
> flume machine output interface in both exeperiments)
> (myfile has 10 million of lines of average size of 150 bytes each)
>
>  For what I understood till now It doesn't seem a source issue as the
> memory channel tends to fill up if I decrease the channel capacity (but
> even make it very very very very big it does not affect sink perfomances),
> so it seems to me that the problem is related to sink.
> In order to test this point I've also tried to change the source using
> "exec" type and simply executing "cat myfile"  but the result hasn't
> changed....
>
>
>  Here's my used config...
>
>   # list the sources, sinks and channels for the agent
> test.sources = r1
> test.channels = c1
>  test.sinks = s1
>
>  # exec attempt
> test.sources.r1.type = exec
> test.sources.r1.command = cat /tmp/myfile
>
>  # my netcat attempt
> #test.sources.r1.type = netcat
> #test.sources.r1.bind = localhost
> #test.sources.r1.port = 6666
>
>  # my file channel attempt
> #test.channels.c1.type = file
>
> #my memory channel attempt
> test.channels.c1.type = memory
> test.channels.c1.capacity = 1000000
> test.channels.c1.transactionCapacity = 10000
>
>  # how to properly set those parameter?? even if I enable those nothing
> changes
> # in my performances (what it the buffer percentage used for?)
> #test.channels.c1.byteCapacityBufferPercentage = 50
> #test.channels.c1.byteCapacity = 100000000
>
>  # set channel for source
> test.sources.r1.channels = c1
> # set channel for sink
> test.sinks.s1.channel = c1
>
>  test.sinks.s1.type = hdfs
> test.sinks.s1.hdfs.useLocalTimeStamp = true
>
>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
> test.sinks.s1.hdfs.filePrefix = log-data
> test.sinks.s1.hdfs.inUseSuffix = .dat
>
>  # how to set this parameter??? (i basically want to send as much data as
> I can)
> test.sinks.s1.hdfs.batchSize = 10000
>
> #test.sinks.s1.hdfs.round = true
> #test.sinks.s1.hdfs.roundValue = 5
> #test.sinks.s1.hdfs.roundUnit = minute
>
> test.sinks.s1.hdfs.rollSize = 0
> test.sinks.s1.hdfs.rollCount = 0
> test.sinks.s1.hdfs.rollInterval = 0
>
> # compression attempt
> #test.sinks.s1.hdsf.fileType = CompressedStream
> #test.sinks.s1.hdfs.codeC=gzip
> #test.sinks.s1.hdfs.codeC=BZip2Codec
> #test.sinks.s1.hdfs.callTimeout = 120000
>
>  Can someone show me how to find this bottleneck/ configuration mistake?
> (I can't believe that those are flume performance on my machine)
>
>  Thanks a lot if you can help me
> Regards.
> Sebastiano
>
>
>

--089e0149cf94ec002c0502244a38
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Paul,<div>thank for your answer.</div><div>As I&#39; m =
a newbie of Flume How can I attach multiple sinks to the same channel? (doe=
s they read data in a round robin fashon from the memory channel?)</div><di=
v>
(does this create multiple files on the hdfs?, because this is not what I&#=
39;m expecting to have I have a 500MB data file at the source and I would l=
ike to have only one file on HDFS)</div><div><br></div><div>I can&#39;t bel=
ieve that I cannot achieve such a performance with a single sink. I&#39;m p=
retty sure it&#39;s a configuration issue!</div>
<div>Beside this how to tune the batchSize parameter? (Of course I have alr=
eady tried to set it like 10 times the number I have in my config, but no r=
elevant improvements)</div><div>Regards.</div><div>Seba</div></div><div cla=
ss=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Wed, Sep 3, 2014 at 9:11 AM, Paul Cha=
vez <span dir=3D"ltr">&lt;<a href=3D"mailto:pchavez@ntent.com" target=3D"_b=
lank">pchavez@ntent.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">


<div dir=3D"auto">
<div>Start adding additional HDFS sinks attached to the same channel. You c=
an also tune batch sizes when writing to HDFS to increase per sink performa=
nce.</div>
<div><br>
On Sep 2, 2014, at 11:54 PM, &quot;Sebastiano Di Paola&quot; &lt;<a href=3D=
"mailto:sebastiano.dipaola@gmail.com" target=3D"_blank">sebastiano.dipaola@=
gmail.com</a>&gt; wrote:<br>
<br>
</div>
<blockquote type=3D"cite">
<div>
<div dir=3D"ltr">Hi there,
<div>I&#39;m a completely newbie of Flume, so I probably made a mistake in =
my configuration but I cannot point it out.</div>
<div>I want to achieve transfer maximum performances.</div>
<div>My flume machine has 16GB RAM and 8 Cores</div>
<div>I&#39;m using a very simple Flume architecture:</div>
<div>Source -&gt; Memory Channel -&gt; Sink</div>
<div>Source is of type netcat</div>
<div>and Sink is hdfs</div>
<div>The machine has 1Gb ethernet directly connected to the switch of the h=
adoop cluster.</div>
<div>The point is that Flume is sooo slow in loading the data into my hdfs =
filesystem.</div>
<div>(i.e. using hdfs dfs -copyFromLocal myfile <u>/flume/events/</u>myfile=
 from the same machine I will reach approx 250=C2=A0Mb/s as transfer rate, =
while transferring the same file with this Flume architecture is like 2-3 M=
b/s). (the cluster is composed of 10
 machines, and was totally idle while I did this test, so was not under str=
ess) (the traffic rate was measured on the flume machine output interface i=
n both exeperiments)</div>
<div>(myfile has 10 million of lines of average size of 150 bytes each)<br>
</div>
<div><br>
</div>
<div>For what I understood till now It doesn&#39;t seem a source issue as t=
he memory channel tends to fill up if I decrease the channel capacity (but =
even make it very very very very big it does not affect sink perfomances), =
so it seems to me that the problem is
 related to sink.</div>
<div>In order to test this point I&#39;ve also tried to change the source u=
sing &quot;exec&quot; type and simply executing &quot;cat myfile&quot; =C2=
=A0but the result hasn&#39;t changed....</div>
<div><br>
</div>
<div><br>
</div>
<div>Here&#39;s my used config...</div>
<div><br>
</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">=C2=A0</sp=
an><span dir=3D"ltr" style=3D"font-family:arial,sans-serif;font-size:13px">=
# list the sources, sinks and channels for the agent<br>
test.sources =3D r1<br>
test.channels =3D c1</span><br>
</div>
<div><span dir=3D"ltr" style=3D"font-family:arial,sans-serif;font-size:13px=
">test.sinks =3D s1<br>
</span></div>
<div><br>
</div>
<div># exec attempt</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">test.sourc=
</span><span style=3D"font-family:arial,sans-serif;font-size:13px">es.r1.ty=
pe =3D exec</span><br style=3D"font-family:arial,sans-serif;font-size:13px"=
>
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.sourc</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">es.r1.command=
 =3D cat /tmp/myfile</span><br>
</div>
<div><br>
</div>
<div># my netcat attempt</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">#test.sour=
</span><span style=3D"font-family:arial,sans-serif;font-size:13px">ces.r1.t=
ype =3D netcat</span><br style=3D"font-family:arial,sans-serif;font-size:13=
px">

<span style=3D"font-family:arial,sans-serif;font-size:13px">#test.sour</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">ces.r1.bind =
=3D localhost</span><br style=3D"font-family:arial,sans-serif;font-size:13p=
x">
<span style=3D"font-family:arial,sans-serif;font-size:13px">#test.sour</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">ces.r1.port =
=3D 6666</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
</div>
<div><br>
</div>
<div># my file channel attempt</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">#test.chan=
</span><span style=3D"font-family:arial,sans-serif;font-size:13px">nels.c1.=
type =3D file</span><br style=3D"font-family:arial,sans-serif;font-size:13p=
x">

<br>
#my memory channel attempt<br style=3D"font-family:arial,sans-serif;font-si=
ze:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.chann</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">els.c1.type =
=3D memory</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.chann</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">els.c1.capaci=
ty =3D 1000000</span><br style=3D"font-family:arial,sans-serif;font-size:13=
px">

<span style=3D"font-family:arial,sans-serif;font-size:13px">test.chann</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">els.c1.transa=
ct</span><span style=3D"font-family:arial,sans-serif;font-size:13px">ionCap=
acity =3D 10000</span><br>

</div>
<div><br>
</div>
<div># how to properly set those parameter?? even if I enable those nothing=
 changes</div>
<div># in my performances (what it the buffer percentage used for?)</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">#test.chan=
</span><span style=3D"font-family:arial,sans-serif;font-size:13px">nels.c1.=
byteCap</span><span style=3D"font-family:arial,sans-serif;font-size:13px">a=
cityBufferPerc</span><span style=3D"font-family:arial,sans-serif;font-size:=
13px">entage
 =3D 50</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">#test.chan</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">nels.c1.byteC=
ap</span><span style=3D"font-family:arial,sans-serif;font-size:13px">acity =
=3D 100000000</span><br>

</div>
<div><br>
</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px"># set chan=
nel for source</span><br style=3D"font-family:arial,sans-serif;font-size:13=
px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.sourc</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">es.r1.channel=
s =3D c1</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px"># set channel f=
or sink</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.sinks</spa=
n><span style=3D"font-family:arial,sans-serif;font-size:13px">.s1.channel =
=3D c1</span><br>
</div>
<div><br>
</div>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px">test.sinks=
</span><span style=3D"font-family:arial,sans-serif;font-size:13px">.s1.type=
 =3D hdfs</span><br style=3D"font-family:arial,sans-serif;font-size:13px">
<span style=3D"font-family:arial,sans-serif;font-size:13px">test.sink</span=
><span style=3D"font-family:arial,sans-serif;font-size:13px">s.s1.hdfs.useL=
o</span><span style=3D"font-family:arial,sans-serif;font-size:13px">calTime=
Stamp =3D true</span><br style=3D"font-family:arial,sans-serif;font-size:13=
px">

</div>
<div><br>
</div>
<div>
<div style=3D"font-family:arial,sans-serif;font-size:13px"><span dir=3D"ltr=
">test.sinks.s1.hdfs.path =3D hdfs://mynodemanager<u>:9000/flume/events/</u=
><br>
test.sinks.s1.hdfs.filePrefix =3D log-data</span></div>
<div dir=3D"ltr" style=3D"font-family:arial,sans-serif;font-size:13px">test=
.sinks.s1.hdfs.inUseSuffix =3D .dat</div>
<div dir=3D"ltr" style=3D"font-family:arial,sans-serif;font-size:13px"><br>
</div>
<div dir=3D"ltr" style=3D"font-family:arial,sans-serif;font-size:13px"># ho=
w to set this parameter??? (i basically want to send as much data as I can)=
<br>
test.sinks.s1.hdfs.batchSize =3D 10000<br>
<br>
#test.sinks.s1.hdfs.round =3D true<br>
#test.sinks.s1.hdfs.roundValue =3D 5<br>
#test.sinks.s1.hdfs.roundUnit =3D minute<br>
<br>
test.sinks.s1.hdfs.rollSize =3D 0<br>
test.sinks.s1.hdfs.rollCount =3D 0<br>
test.sinks.s1.hdfs.rollInterval =3D 0<br>
<br>
# compression attempt<br>
#test.sinks.s1.hdsf.fileType =3D CompressedStream<br>
#test.sinks.s1.hdfs.codeC=3Dgzip<br>
#test.sinks.s1.hdfs.codeC=3DBZip2Codec<br>
#test.sinks.s1.hdfs.callTimeout =3D 120000</div>
</div>
<div><br>
</div>
<div>Can someone show me how to find this bottleneck/ configuration mistake=
? (I can&#39;t believe that those are flume performance on my machine)</div=
>
<div><br>
</div>
<div>Thanks a lot if you can help me</div>
<div>Regards.</div>
<div>Sebastiano</div>
<div>=C2=A0</div>
</div>
</div>
</blockquote>
</div>

</blockquote></div><br></div>

--089e0149cf94ec002c0502244a38--