As per the collector design, the collector accepts multiple chunks and writes each chunk to hdfs. If all the chunks are written to hdfs, collector sends back 200 status to agent

If hdfs write fails in between, the collector aborts entire processing and sends exception. This could mean that the data is partially written to hdfs. I have a couple of questions

 

1.       The agent does not receive response 200. Does it resend the same data to another collector? How does checkpointing works in this case?

2.       If the agent sends same data to another collector and it goes to hdfs, there is a duplication of some records. Are those duplicates filtered when preprocessor runs?

 

In summary what data loss happens when hdfs goes down from collector perspective?

 

Thanks,

Jaydeep

 

Jaydeep Ayachit | Persistent Systems Ltd

Cell: +91 9822393963 | Desk: +91 712 3986747

 

DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.