incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: duplicate data
Date Fri, 19 Mar 2010 04:54:47 GMT
The sequence ID of a chunk is, by default, the offset in the file of
its first byte.  We do some fairly complex hacks for file rotation, to
make sure that the IDs continue growing monotonically in that case.
If you start a tailer on a file, and leave it running, each line will
get numbered uniquely. if you stop it, and then start a new one at the
beginning of the file, you'll get duplicate data.

If you start a tailer, stop it, modify or overwrite the file, and then
start a new tailer, you'll be spurious duplicates.


On Thu, Mar 18, 2010 at 9:50 PM, Corbin Hoenes <> wrote:
> So in scenario the stream name should be the same but how do sequence IDs get generated?
 If I tried to tail the same log file 24 hours after doing it the first time would they have
the same seq id?
> On Mar 18, 2010, at 11:24 AM, Ariel Rabkin wrote:
>> Howdy,
>> Chukwa does duplicate detection as follows: Each Chunk of data comes
>> with a stream name (such as the name of a log file) and a sequence ID.
>> If two chunks have the same name and ID, they're duplicate.  The
>> content isn't inspected.
>> So in your example, the former will be treated as a duplicate, not the latter.
>> --Ari
>> On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <> wrote:
>>> Does anyone have more information about how chukwa removes duplicates during
demux? How does it decide what is a duplicate?  There are two cases I am thinking of...
>>> 1 - we send the same log file to chukwa 2x
>>> 2 - we have the exact same line in a log file 2x
>> --
>> Ari Rabkin
>> UC Berkeley Computer Science Department

Ari Rabkin
UC Berkeley Computer Science Department

View raw message