incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <ey...@yahoo-inc.com>
Subject Re: Data loss on collector side
Date Thu, 28 Oct 2010 17:02:09 GMT



On 10/28/10 7:58 AM, "Jaydeep Ayachit" <jaydeep_ayachit@persistent.co.in>
wrote:

> As per the collector design, the collector accepts multiple chunks and writes
> each chunk to hdfs. If all the chunks are written to hdfs, collector sends
> back 200 status to agent
> If hdfs write fails in between, the collector aborts entire processing and
> sends exception. This could mean that the data is partially written to hdfs. I
> have a couple of questions
>  
> 1.      The agent does not receive response 200. Does it resend the same data
> to another collector? How does checkpointing works in this case?
> 

Agent check for response HTTP 200, if it doesn't receive OK status, it will
send to another collector from it's list.  Checkpoint is updated after HTTP
200 status is received.

> 2.      If the agent sends same data to another collector and it goes to hdfs,
> there is a duplication of some records. Are those duplicates filtered when
> preprocessor runs?

It is possible to build a preprocessor filter to remove duplicate data for
small time window.  However, it doesn't guarantee to remove 100% of
duplicates because duplicated data can be received in different batch of the
Archive/Demux process.  I recommend to remove duplicates when data is being
indexed where the down stream program like hbase or mysql has view of all
the data.

> In summary what data loss happens when hdfs goes down from collector
> perspective?

When HDFS goes down, then collector exits.  Hence, it is possible to lose up
to 15 second data if the last flush to HDFS did not store data to datanode.
In this case, collector will not send HTTP code 200 to agent, and data is
resent by the agent.  There is also a localWriter which writes data locally
on collector node, then upload to HDFS.  This assumes collector local disk
is more reliable than HDFS.  I don't think this is a common scenario.

Regards,
Eric

>  
> Thanks,
> Jaydeep
>  
> Jaydeep Ayachit | Persistent Systems Ltd
> Cell: +91 9822393963 | Desk: +91 712 3986747
>  
> DISCLAIMER ========== This e-mail may contain privileged and confidential
> information which is the property of Persistent Systems Ltd. It is intended
> only for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient, you are not authorized to read, retain, copy,
> print, distribute or use this message. If you have received this communication
> in error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
> 


Mime
View raw message