flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Hübner <cont...@dhuebner.com>
Subject Flume timestamp partitioning overlaps
Date Wed, 08 Jul 2015 08:23:42 GMT
I am using Cloudera’s example source to collect a sample of Twitter’s stream partitioned
by year -> month -> day -> hour. 
https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>

timestamp of an event is set by 
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

My agent config:
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/
<hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/>

However, I see that in almost all hours there is at least one (more often multiple records)
from the last second of the previous hour. 

Is there any way to prevent having those overlaps in data? 
Hourly aggregation without dropping data becomes unnecessarily messy due to this.
Mime
View raw message