flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Sinclair <dsincl...@chariotsolutions.com>
Subject Re: Roll based on date
Date Fri, 25 Oct 2013 14:20:21 GMT
does the metrics endpoint show that events are still coming into this sink?

http://hostname of agent:41414/metrics <http://falcon:41414/metrics>

Also, can you post the rest of the config?


On Thu, Oct 24, 2013 at 10:09 PM, Martinus m <martinus787@gmail.com> wrote:

> Hi David,
>
> Almost every few seconds.
>
> Thanks.
>
> Martinus
>
>
> On Thu, Oct 24, 2013 at 9:49 PM, David Sinclair <
> dsinclair@chariotsolutions.com> wrote:
>
>> How often are your events coming in?
>>
>>
>> On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <martinus787@gmail.com>wrote:
>>
>>> Hi David,
>>>
>>> Thanks for the example. I have set it just like above, but it only
>>> generate for the first 15 minutes. After waiting for more than one hour,
>>> there is no update at all in the s3 bucket.
>>>
>>> Thanks.
>>>
>>> Martinus
>>>
>>>
>>> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
>>> dsinclair@chariotsolutions.com> wrote:
>>>
>>>> You can set all of the time/size based rolling policies to zero and set
>>>> an idle timeout on the sink. Below has a 15 minute timeout
>>>>
>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>> agent.sinks.sink.hdfs.batchSize = 0
>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>> agent.sinks.sink.hdfs.idleTimeout = 900
>>>>
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <martinus787@gmail.com>wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> The requirement is only roll per day actually.
>>>>>
>>>>> Hi Devin,
>>>>>
>>>>> Thanks for sharing your experienced. I also tried to set the config as
>>>>> following :
>>>>>
>>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>
>>>>> But I didn't see anything on the s3 bucket. So I guess, I need to
>>>>> change the rollInterval into 86400. In my understanding, rollInterval
86400
>>>>> will roll the file after 24 hours like you said, but it will not generate
>>>>> new file if it's changed the day and haven't been 24 hours interval (unless
>>>>> we put prefix to fileSuffix as above).
>>>>>
>>>>> Thanks to both of you.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martinus
>>>>>
>>>>>
>>>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <dsuiter@rdx.com>
wrote:
>>>>>
>>>>>> Martinus, you have to set all the other roll options to 0 explicitly
>>>>>> in the configuration if you want them only to roll on one parameter,
it
>>>>>> will take the shortest working parameter it can meet for the roll.
If you
>>>>>> want it to roll once a day, you will have to specifically disable
all the
>>>>>> other options for roll triggers - they all take default settings
unless
>>>>>> told not to. When I was experimenting, for example, it kept rolling
in 30
>>>>>> seconds even though I had the hdfs.rollSize set to 64MB (our test
data is
>>>>>> generated slowly). So I ended up with a pile of small (0.2KB - 19~KB)
files
>>>>>> in a bunch of directories sorted by timestamp in ten-minute intervals.
>>>>>>
>>>>>> So, maybe a conf like this:
>>>>>>
>>>>>> agent.sinks.sink.type = hdfs
>>>>>> agent.sinks.sink.channel = channel
>>>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>>>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>>>>> agent.sinks.sink.serializer = avro_event
>>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>>>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>>
>>>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file
>>>>>> size for the file, and will close the file if it has 15000 events
in it,
>>>>>> but if the hdfs.rollCount line was not set to "0" or some higher
value (I
>>>>>> probably could have set that at 15000 to match the hdfs.batchSize
for same
>>>>>> results) then the file would roll as soon as the default of only
10 events
>>>>>> were written in to the file.
>>>>>>
>>>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP,
we
>>>>>> collect from syslogTCP which comes from remote host. It then goes
to avro
>>>>>> sink to aggregate the small event entries into larger avro files.
Then, a
>>>>>> second tier collects that with avro source, then hdfs sink. So, we
get them
>>>>>> all as individual events streamed into an avro container, then the
avro
>>>>>> container is put into HDFS every 24 hours or if it hits 128 MB. We
were
>>>>>> getting many small files because of the lower velocity of our sample
set,
>>>>>> and we did not want to clutter up FSImage. The avro serializer and
>>>>>> DataStream type are necessary also, because the default behavior
of HDFS
>>>>>> sink is to put things in as SequenceFile format.
>>>>>>
>>>>>> Hope this helps you out.
>>>>>>
>>>>>> Sincerely,
>>>>>> *Devin Suiter*
>>>>>> Jr. Data Solutions Software Engineer
>>>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <
>>>>>> dsinclair@chariotsolutions.com> wrote:
>>>>>>
>>>>>>> Do you need to roll based on size as well? Can you tell me the
>>>>>>> requirements?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <martinus787@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> Thanks for your answer. I already did that, but using %Y-%m-%d.
>>>>>>>> But, since there are still roll based on Size, so it will
keep generating
>>>>>>>> two or mores FlumeData.%Y-%m-%d with different postfix.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Martinus
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>>>>>>>> dsinclair@chariotsolutions.com> wrote:
>>>>>>>>
>>>>>>>>> The SyslogTcpSource will put a header on the flume event
named
>>>>>>>>> 'timestamp'. This timestamp will be from the syslog entry.
You could then
>>>>>>>>> set the filePrefix in the sink to grab this out.
>>>>>>>>> For example
>>>>>>>>>
>>>>>>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp}
>>>>>>>>>
>>>>>>>>> dave
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <
>>>>>>>>> martinus787@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> It's syslogtcp.
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> Martinus
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair <
>>>>>>>>>> dsinclair@chariotsolutions.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What type of source are you using?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <
>>>>>>>>>>> martinus787@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Is there any option in HDFS sink that I can
start rolling a new
>>>>>>>>>>>> file whenever the date in the log change?
For example, I got below logs :
>>>>>>>>>>>>
>>>>>>>>>>>> Oct 16 23:58:56 test-host : just test
>>>>>>>>>>>> Oct 16 23:59:51 test-host : test again
>>>>>>>>>>>> Oct 17 00:00:56 test-host : just test
>>>>>>>>>>>> Oct 17 00:00:56 test-host : test again
>>>>>>>>>>>>
>>>>>>>>>>>> Then I want it to make a file on S3 bucket
with result like
>>>>>>>>>>>> this :
>>>>>>>>>>>>
>>>>>>>>>>>> FlumeData.2013-10-16.1381916293017 <--
all the logs with Oct 16
>>>>>>>>>>>> from this year 2013 will goes to here and
when it's reach Oct 17 year 2013,
>>>>>>>>>>>> then it will start to sink into a new file
below :
>>>>>>>>>>>>
>>>>>>>>>>>> FlumeData.2013-10-17.1381940047117
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message