flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadish Bihani <jagadish.bih...@pubmatic.com>
Subject Re: File Channel performance and fsync
Date Mon, 22 Oct 2012 13:18:49 GMT
Hi

This is the simplistic configuration with which I am getting lower 
performance.
Even with 2-tier architecture (cat source - avro sinks - avro source- 
HDFS sink)
I get the similar performance with file channel.

Configuration:
=========
adServerAgent.sources = avro-collection-source
adServerAgent.channels = fileChannel
adServerAgent.sinks = hdfsSink fileSink

# For each one of the sources, the type is defined
adServerAgent.sources.avro-collection-source.type=exec
adServerAgent.sources.avro-collection-source.command= cat 
/home/hadoop/file.tsf

# The channel can be defined as follows.
adServerAgent.sources.avro-collection-source.channels = fileChannel

#Define file sink
adServerAgent.sinks.fileSink.type = file_roll
adServerAgent.sinks.fileSink.sink.directory = /home/hadoop/flume_sink*
*
adServerAgent.sinks.fileSink.channel = fileChannel
adServerAgent.channels.fileChannel.type=file
adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
adServerAgent.channels.fileChannel.maxFileSize=4000000000

And it is run with :
JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote 
-XX:MaxDirectMemorySize=2g

Regards,
Jagadish

On 10/22/2012 05:42 PM, Brock Noland wrote:
> Hi,
>
> I'll respond in more depth later, but it would help if you posted your 
> configuration file and the version of flume you are using.
>
> Brock
>
> On Mon, Oct 22, 2012 at 6:48 AM, Jagadish Bihani 
> <jagadish.bihani@pubmatic.com <mailto:jagadish.bihani@pubmatic.com>> 
> wrote:
>
>     Hi
>
>     I am writing this on top of another thread where there was
>     discussion on "fsync lies" and
>     only file channel used fsync and not file sink. :
>
>     -- I tested the fsync performance on 2 machines  (On 1 machine I
>     was getting very good throughput
>     using file channel and on another almost 100 times slower with
>     almost same hardware configuration.)
>     using following code
>
>
>     #define PAGESIZE 4096
>
>     int main(int argc, char *argv[])
>     {
>
>             char my_write_str[PAGESIZE];
>             char my_read_str[PAGESIZE];
>             char *read_filename= argv[1];
>             int readfd,writefd;
>
>             readfd = open(read_filename,O_RDONLY);
>             writefd = open("written_file",O_WRONLY|O_CREAT,777);
>             int len=lseek(readfd,0,2);
>             lseek(readfd,0,0);
>             int iterations = len/PAGESIZE;
>             int i;
>             struct timeval t0,t1;
>
>            for(i=0;i<iterations;i++)
>             {
>
>                     read(readfd,my_read_str,PAGESIZE);
>                     write(writefd,my_read_str,PAGESIZE);
>     *gettimeofday(&t0,0);**
>     **                fsync(writefd);**
>     **              gettimeofday(&t1,0);*
>                     long elapsed = (t1.tv_sec-t0.tv_sec)*1000000 +
>     t1.tv_usec-t0.tv_usec;
>                     printf("Elapsed time is= %ld \n",elapsed);
>              }
>             close(readfd);
>             close(writefd);
>     }
>
>
>     -- As expected it requires typically 50000 microseconds for fsync
>     to complete on one machine and 200 microseconds
>     on another machine it took 290 microseconds to complete on an
>     average. So is machine with higher
>     performance is doing a 'fsync lie'?
>     i
>     -- If I have understood it clearly; "fsync lie" means the data is
>     not actually written to disk and it is in
>     some disk/controller buffer.  I) Now if disk loses power due to
>     some shutdown or any other disaster, data will
>     be lost. II) Can data be lost even without it ? (e.g. if it is
>     keeping data in some disk buffer and if fsync is being
>     invoked continuously then will that data can also  be lost? If
>     only part -I is true; then it can be acceptable
>     because probability of shutdown is usually less in production
>     environment. But if even II is true then there is a
>     problem.
>
>     -- But on the machine where disk doesn't lie performance of flume
>     using File channel is very low (I have seen it
>     maximum 100 KB/sec even with sufficient  DirectMemory allocation.)
>     Does anybody have stats about throughput
>     of file channel ? Is anybody getting better performance with file
>     channel (without fsync lies). What is the recommended
>     usage of it for an average scenario ? (Transferring files of few
>     MBs to HDFS sink continuously on typical hardware
>     (16 core processors, 16 GB RAM etc.)
>
>
>     Regards,
>     Jagadish
>
>     On 10/10/2012 11:30 PM, Brock Noland wrote:
>>     Hi,
>>
>>     On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani
>>     <jagadish.bihani@pubmatic.com>  <mailto:jagadish.bihani@pubmatic.com>
 wrote:
>>>     Hi Brock
>>>
>>>     I will surely look into 'fsync lies'.
>>>
>>>     But as per my experiments I think "file channel" is causing the issue.
>>>     Because on those 2 machines (one with higher throughput and other with
>>>     lower)
>>>     I did following experiment:
>>>
>>>     cat Source -memory channel - file sink.
>>>
>>>     Now with this setup I got same throughput on both the machines. (around 3
>>>     MB/sec)
>>>     Now as I have used "File sink" it should also do "fsync" at some point of
>>>     time.
>>>     'File Sink' and 'File Channel' both do disk writes.
>>>     So if there is differences in disk behaviour then even in the 'File Sink'
it
>>>     should be visible.
>>>
>>>     Am I missing something here?
>>     File sink does not call fsync.
>>
>>>     Regards,
>>>     Jagadish
>>>
>>>
>>>
>>>     On 10/10/2012 09:35 PM, Brock Noland wrote:
>>>>     OK your disk that is giving you 40KB/second is telling you the truth
>>>>     and the faster disk is lying to you. Look up "fsync lies" to see what
>>>>     I am referring to.
>>>>
>>>>     A spinning disk can do 100 fsync operations per second (this is done
>>>>     at the end of every batch). That is how I estimated your event size,
>>>>     40KB/second is doing 40KB / 100 =  409 bytes.
>>>>
>>>>     Once again, if you want increased performance, you should increase the
>>>>     batch size.
>>>>
>>>>     Brock
>>>>
>>>>     On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani
>>>>     <jagadish.bihani@pubmatic.com>  <mailto:jagadish.bihani@pubmatic.com>
 wrote:
>>>>>     Hi
>>>>>
>>>>>     Yes. It is around 480 - 500 bytes.
>>>>>
>>>>>
>>>>>     On 10/10/2012 09:24 PM, Brock Noland wrote:
>>>>>>     How big are your events? Average about 400 bytes?
>>>>>>
>>>>>>     Brock
>>>>>>
>>>>>>     On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani
>>>>>>     <jagadish.bihani@pubmatic.com>  <mailto:jagadish.bihani@pubmatic.com>
 wrote:
>>>>>>>     Hi
>>>>>>>
>>>>>>>     Thanks for the inputs Brock. After doing several experiments
>>>>>>>     eventually problem boiled down to disks.
>>>>>>>
>>>>>>>         -- But I had used the same configuration (so all software
components
>>>>>>>     are
>>>>>>>     same in all 3 machines)
>>>>>>>     on all 3 machines.
>>>>>>>     -- In User guide it is written that if multiple file channel
instances
>>>>>>>     are
>>>>>>>     active on the same agent then
>>>>>>>     different disks are preferable. But in my case only one file
channel is
>>>>>>>     active per agent.
>>>>>>>     -- Only one pattern I observed that on the machines where
I got better
>>>>>>>     performance have multiple disks.
>>>>>>>     But I don't understand how that will help if I have only
1 active file
>>>>>>>     channel.
>>>>>>>     -- What is the impact of the type of disk/disk device driver
on
>>>>>>>     performance?
>>>>>>>     I mean I don't understand
>>>>>>>     with 1 disk I am getting 40 KB/sec and with other 2 MB/sec.
>>>>>>>
>>>>>>>     Could you please elaborate on File channel and disks correlation.
>>>>>>>
>>>>>>>     Regards,
>>>>>>>     Jagadish
>>>>>>>
>>>>>>>
>>>>>>>     On 10/09/2012 08:01 PM, Brock Noland wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>     Using file channel, in terms of performance, the number and
type of
>>>>>>>     disks is going to be much more predictive of performance
than CPU or
>>>>>>>     RAM. Note that consumer level drives/controllers will give
you much
>>>>>>>     "better" performance because they lie to you about when your
data is
>>>>>>>     actually written to the drive. If you search for "fsync lies"
you'll
>>>>>>>     find more information on this.
>>>>>>>
>>>>>>>     You probably want to increase the batch size to get better
performance.
>>>>>>>
>>>>>>>     Brock
>>>>>>>
>>>>>>>     On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani
>>>>>>>     <jagadish.bihani@pubmatic.com>  <mailto:jagadish.bihani@pubmatic.com>
 wrote:
>>>>>>>
>>>>>>>     Hi
>>>>>>>
>>>>>>>     My flume setup is:
>>>>>>>
>>>>>>>     Source Agent : cat source - File Channel - Avro Sink
>>>>>>>     Dest Agent :     avro source - File Channel - HDFS Sink.
>>>>>>>
>>>>>>>     There is only 1 source agent and 1 destination agent.
>>>>>>>
>>>>>>>     I measure throughput as amount of data written to HDFS per
second.
>>>>>>>     ( I have rolling interval 30 sec; so If 60 MB file is generated
in 30
>>>>>>>     sec
>>>>>>>     the
>>>>>>>     throughput is : -- 2 MB/sec ).
>>>>>>>
>>>>>>>     I have run source agent on various machines with different
hardware
>>>>>>>     configurations :
>>>>>>>     (In all cases I run flume agent with JAVA OPTIONS as
>>>>>>>     "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote
>>>>>>>     -XX:MaxDirectMemorySize=2g")
>>>>>>>
>>>>>>>     JDK is 32 bit.
>>>>>>>
>>>>>>>     Experiment 1:
>>>>>>>     =====
>>>>>>>     RAM : 16 GB
>>>>>>>     Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores).
>>>>>>>     64 bit Processor with 64 bit Kernel.
>>>>>>>     Throughput: 2 MB/sec
>>>>>>>
>>>>>>>     Experiment 2:
>>>>>>>     ======
>>>>>>>     RAM : 4 GB
>>>>>>>     Processor: Intel Xeon E5504  @ 2.00GHz (4 cores). 32 bit
Processor
>>>>>>>     64 bit Processor with 32 bit Kernel.
>>>>>>>     Throughput : 30 KB/sec
>>>>>>>
>>>>>>>     Experiment 3:
>>>>>>>     ======
>>>>>>>     RAM : 8 GB
>>>>>>>     Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor
>>>>>>>     64 bit Processor with 32 bit Kernel.
>>>>>>>     Throughput : 80 KB/sec
>>>>>>>
>>>>>>>         -- So as can be seen there is huge difference in the
throughput with
>>>>>>>     same
>>>>>>>     configuration but
>>>>>>>     different hardware.
>>>>>>>     -- In the first case where throughput is more RES is around
160 MB in
>>>>>>>     other
>>>>>>>     cases it is in
>>>>>>>     the range of 40 MB - 50 MB.
>>>>>>>
>>>>>>>     Can anybody please give insights that why there is this huge
difference
>>>>>>>     in
>>>>>>>     the throughput?
>>>>>>>     What is the correlation between RAM and filechannel/HDFS
sink
>>>>>>>     performance
>>>>>>>     and also
>>>>>>>     with 32-bit/64 bit kernel?
>>>>>>>
>>>>>>>     Regards,
>>>>>>>     Jagadish
>>>>>>>
>>>>>>>
>>>>>>>
>
>
>
>
> -- 
> Apache MRUnit - Unit testing MapReduce - 
> http://incubator.apache.org/mrunit/


Mime
View raw message