flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lohit <lohit.vijayar...@gmail.com>
Subject Re: HDFS Sink performance
Date Thu, 16 Jul 2015 05:43:12 GMT
Thanks for information Roshan. I was able to find your email.
>From your experiment the best you could get was 538K message for single
agent which you mentioned was about ~250MB/s. Do you know what was
compression ratio? Also how much memory did you give for agent?
These numbers are similar to what we are seeing. WIth 2 sinks we see about
50K (1K messages) so ~50MB/s.

2015-07-15 13:45 GMT-07:00 Roshan Naik <roshan@hortonworks.com>:

>  Yes.. My bad.. Been meaning to do it… will try to do it his week.
> -roshan
>
>   From: Hari Shreedharan <hshreedharan@cloudera.com>
> Reply-To: "user@flume.apache.org" <user@flume.apache.org>
> Date: Wednesday, July 15, 2015 1:41 PM
>
> To: "user@flume.apache.org" <user@flume.apache.org>
> Subject: Re: HDFS Sink performance
>
>   Roshan - how about posting that on the Flume wiki?
>
>
> Thanks,
> Hari
>
> On Wed, Jul 15, 2015 at 1:07 PM, Roshan Naik <roshan@hortonworks.com>
> wrote:
>
>>  Lohit,
>> You may want to search the mailing list for 'Flume perf measurements' .
>> You should find the recent measurements I posted.
>> -roshan
>>
>>   From: lohit <lohit.vijayarenu@gmail.com>
>> Reply-To: "user@flume.apache.org" <user@flume.apache.org>
>> Date: Wednesday, July 15, 2015 11:19 AM
>> To: "user@flume.apache.org" <user@flume.apache.org>
>> Subject: Re: HDFS Sink performance
>>
>>   Thanks for the reply Hari. Multiple Sinks make sense, but this would
>> also mean there is lot more files on HDFS. I will try multiple sinks and
>> see how fast this can go to.
>> Given that single HDFS stream can do much higher throughput, may be there
>> is way to have threadpool for SinkRunner-PollingRunner-DefaultSinkProcessor
>> instead of single thread per sink.
>>
>> 2015-07-15 11:11 GMT-07:00 Hari Shreedharan <hshreedharan@cloudera.com>:
>>
>>> Hi Lohit,
>>>
>>>  HDFS sinks (in fact, most sinks) are single-threaded by design. This
>>> is meant to make writing the sinks easier, but all channels can handle
>>> multiple sinks reading from them. So to improve the efficiency, you
>>> basically configure several sinks which read off the same channel. Make
>>> sure that each sink though writes to files with different HDFS paths or
>>> different file prefixes (else HDFS client API will complain about leases).
>>>
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Wed, Jul 15, 2015 at 9:10 AM, lohit <lohit.vijayarenu@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>>  Does anyone have some numbers which they can share around HDFS sink
>>>> performance. From our testing, for single sink writing to HDFS
>>>> (CompressedStream) and reading from MemoryChannel can only do about 35000
>>>> events per second (each event is about 1K) in size. After compression this
>>>> turns out to be ~10MB/s write stream to HDFS file. Which is pretty low. Our
>>>> configuration looks like this
>>>>
>>>>  agent.sinks.hdfsSink.type = hdfs
>>>> agent.sinks.hdfsSink.channel = memoryChannel
>>>> agent.sinks.hdfsSink.hdfs.path = /tmp/lohit
>>>> agent.sinks.hdfsSink.hdfs.codeC = lzo
>>>> agent.sinks.hdfsSink.hdfs.fileType = CompressedStream
>>>> agent.sinks.hdfsSink.hdfs.writeFormat = Writable
>>>> agent.sinks.hdfsSink.hdfs.rollInterval = 3600
>>>> agent.sinks.hdfsSink.hdfs.rollSize = 1073741824
>>>> agent.sinks.hdfsSink.hdfs.rollCount = 0
>>>> agent.sinks.hdfsSink.hdfs.batchSize = 10000
>>>> agent.sinks.hdfsSink.hdfs.txnEventMax = 10000
>>>>
>>>>  agent.channels.memoryChannel.type = memory
>>>>
>>>>  agent.channels.memoryChannel.capacity = 3000000
>>>> agent.channels.memoryChannel.transactionCapacity = 10000
>>>>
>>>>  --
>>>> Have a Nice Day!
>>>> Lohit
>>>>
>>>
>>>
>>
>>
>>  --
>> Have a Nice Day!
>> Lohit
>>
>
>


-- 
Have a Nice Day!
Lohit

Mime
View raw message