chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: duplicate data
Date Thu, 18 Mar 2010 17:24:24 GMT

Chukwa does duplicate detection as follows: Each Chunk of data comes
with a stream name (such as the name of a log file) and a sequence ID.
If two chunks have the same name and ID, they're duplicate.  The
content isn't inspected.

So in your example, the former will be treated as a duplicate, not the latter.


On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <> wrote:
> Does anyone have more information about how chukwa removes duplicates during demux? How
does it decide what is a duplicate?  There are two cases I am thinking of...
> 1 - we send the same log file to chukwa 2x
> 2 - we have the exact same line in a log file 2x

Ari Rabkin
UC Berkeley Computer Science Department

View raw message