spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Murgia <antonio.murg...@studio.unibo.it>
Subject Best practices to handle corrupted records
Date Thu, 15 Oct 2015 12:28:41 GMT
Hello,
I looked around on the web and I couldn’t find any way to deal in a structured way with
malformed/faulty records during computation. All I was able to find was the flatMap/Some/None
technique + logging.
I’m facing this problem because I have a processing algorithm that extracts more than one
value from each record, but can fail in extracting one of those multiple values, and I want
to keep track of them. Logging is not feasible because this “warning” happens so frequently
that the logs would become overwhelming and impossibile to read.
Since I have 3 different possible outcomes from my processing I modeled it with this class
hierarchy:
[cid:935118B9-A7BA-4D67-815A-B861FA866DAF]
That holds result and/or warnings.
Since Result implements Traversable it can be used in a flatMap, discarding all warnings and
failure results, in the other hand, if we want to keep track of warnings, we can elaborate
them and output them if we need.

Kind Regards
#A.M.
Mime
View raw message