spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Murgia <>
Subject Re: Best practices to handle corrupted records
Date Thu, 15 Oct 2015 15:31:17 GMT
'Either' does not cover the case where the outcome was successful but generated warnings. I
already looked into it and also at 'Try' from which I got inspired. Thanks for pointing it
out anyway!


Il giorno 15 ott 2015, alle ore 16:19, Erwan ALLAIN <<>>
ha scritto:

What about ?

On Thu, Oct 15, 2015 at 2:57 PM, Roberto Congiu <<>>
I came to a similar solution to a similar problem. I deal with a lot of CSV files from many
different sources and they are often malformed.
HOwever, I just have success/failure. Maybe you should  make SuccessWithWarnings a subclass
of success, or getting rid of it altogether making the warnings optional.
I was thinking of making this cleaning/conforming library open source if you're interested.


2015-10-15 5:28 GMT-07:00 Antonio Murgia <<>>:
I looked around on the web and I couldn't find any way to deal in a structured way with malformed/faulty
records during computation. All I was able to find was the flatMap/Some/None technique + logging.
I'm facing this problem because I have a processing algorithm that extracts more than one
value from each record, but can fail in extracting one of those multiple values, and I want
to keep track of them. Logging is not feasible because this "warning" happens so frequently
that the logs would become overwhelming and impossibile to read.
Since I have 3 different possible outcomes from my processing I modeled it with this class
That holds result and/or warnings.
Since Result implements Traversable it can be used in a flatMap, discarding all warnings and
failure results, in the other hand, if we want to keep track of warnings, we can elaborate
them and output them if we need.

Kind Regards

"Good judgment comes from experience.
Experience comes from bad judgment"

View raw message