flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Sinclair <dsincl...@chariotsolutions.com>
Subject Re: Roll based on date
Date Wed, 23 Oct 2013 12:48:26 GMT
You can set all of the time/size based rolling policies to zero and set an
idle timeout on the sink. Below has a 15 minute timeout

agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
agent.sinks.sink.hdfs.fileType = DataStream
agent.sinks.sink.hdfs.rollInterval = 0
agent.sinks.sink.hdfs.rollSize = 0
agent.sinks.sink.hdfs.batchSize = 0
agent.sinks.sink.hdfs.rollCount = 0
agent.sinks.sink.hdfs.idleTimeout = 900



On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <martinus787@gmail.com> wrote:

> Hi David,
>
> The requirement is only roll per day actually.
>
> Hi Devin,
>
> Thanks for sharing your experienced. I also tried to set the config as
> following :
>
> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
> agent.sinks.sink.hdfs.fileType = DataStream
> agent.sinks.sink.hdfs.rollInterval = 0
> agent.sinks.sink.hdfs.rollSize = 0
> agent.sinks.sink.hdfs.batchSize = 15000
> agent.sinks.sink.hdfs.rollCount = 0
>
> But I didn't see anything on the s3 bucket. So I guess, I need to change
> the rollInterval into 86400. In my understanding, rollInterval 86400 will
> roll the file after 24 hours like you said, but it will not generate new
> file if it's changed the day and haven't been 24 hours interval (unless we
> put prefix to fileSuffix as above).
>
> Thanks to both of you.
>
> Best regards,
>
> Martinus
>
>
> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <dsuiter@rdx.com> wrote:
>
>> Martinus, you have to set all the other roll options to 0 explicitly in
>> the configuration if you want them only to roll on one parameter, it will
>> take the shortest working parameter it can meet for the roll. If you want
>> it to roll once a day, you will have to specifically disable all the other
>> options for roll triggers - they all take default settings unless told not
>> to. When I was experimenting, for example, it kept rolling in 30 seconds
>> even though I had the hdfs.rollSize set to 64MB (our test data is generated
>> slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a
>> bunch of directories sorted by timestamp in ten-minute intervals.
>>
>> So, maybe a conf like this:
>>
>> agent.sinks.sink.type = hdfs
>> agent.sinks.sink.channel = channel
>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>> agent.sinks.sink.hdfs.fileSuffix = .avro
>> agent.sinks.sink.serializer = avro_event
>> agent.sinks.sink.hdfs.fileType = DataStream
>> agent.sinks.sink.hdfs.rollInterval = 86400
>> agent.sinks.sink.hdfs.rollSize = 134217728
>> agent.sinks.sink.hdfs.batchSize = 15000
>> agent.sinks.sink.hdfs.rollCount = 0
>>
>> This one will roll in HDFS in 24-hour intervals, or at 128MB file size
>> for the file, and will close the file if it has 15000 events in it, but if
>> the hdfs.rollCount line was not set to "0" or some higher value (I probably
>> could have set that at 15000 to match the hdfs.batchSize for same results)
>> then the file would roll as soon as the default of only 10 events were
>> written in to the file.
>>
>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>> collect from syslogTCP which comes from remote host. It then goes to avro
>> sink to aggregate the small event entries into larger avro files. Then, a
>> second tier collects that with avro source, then hdfs sink. So, we get them
>> all as individual events streamed into an avro container, then the avro
>> container is put into HDFS every 24 hours or if it hits 128 MB. We were
>> getting many small files because of the lower velocity of our sample set,
>> and we did not want to clutter up FSImage. The avro serializer and
>> DataStream type are necessary also, because the default behavior of HDFS
>> sink is to put things in as SequenceFile format.
>>
>> Hope this helps you out.
>>
>> Sincerely,
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <
>> dsinclair@chariotsolutions.com> wrote:
>>
>>> Do you need to roll based on size as well? Can you tell me the
>>> requirements?
>>>
>>>
>>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <martinus787@gmail.com>wrote:
>>>
>>>> Hi David,
>>>>
>>>> Thanks for your answer. I already did that, but using %Y-%m-%d. But,
>>>> since there are still roll based on Size, so it will keep generating two
or
>>>> mores FlumeData.%Y-%m-%d with different postfix.
>>>>
>>>> Thanks.
>>>>
>>>> Martinus
>>>>
>>>>
>>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>>>> dsinclair@chariotsolutions.com> wrote:
>>>>
>>>>> The SyslogTcpSource will put a header on the flume event named
>>>>> 'timestamp'. This timestamp will be from the syslog entry. You could
then
>>>>> set the filePrefix in the sink to grab this out.
>>>>> For example
>>>>>
>>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp}
>>>>>
>>>>> dave
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <martinus787@gmail.com>wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> It's syslogtcp.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Martinus
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair <
>>>>>> dsinclair@chariotsolutions.com> wrote:
>>>>>>
>>>>>>> What type of source are you using?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <martinus787@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there any option in HDFS sink that I can start rolling
a new
>>>>>>>> file whenever the date in the log change? For example, I
got below logs :
>>>>>>>>
>>>>>>>> Oct 16 23:58:56 test-host : just test
>>>>>>>> Oct 16 23:59:51 test-host : test again
>>>>>>>> Oct 17 00:00:56 test-host : just test
>>>>>>>> Oct 17 00:00:56 test-host : test again
>>>>>>>>
>>>>>>>> Then I want it to make a file on S3 bucket with result like
this :
>>>>>>>>
>>>>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with
Oct 16
>>>>>>>> from this year 2013 will goes to here and when it's reach
Oct 17 year 2013,
>>>>>>>> then it will start to sink into a new file below :
>>>>>>>>
>>>>>>>> FlumeData.2013-10-17.1381940047117
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message