spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roberto Congiu <>
Subject Re: Best practices to handle corrupted records
Date Thu, 15 Oct 2015 12:57:12 GMT
I came to a similar solution to a similar problem. I deal with a lot of CSV
files from many different sources and they are often malformed.
HOwever, I just have success/failure. Maybe you should  make
SuccessWithWarnings a subclass of success, or getting rid of it altogether
making the warnings optional.
I was thinking of making this cleaning/conforming library open source if
you're interested.


2015-10-15 5:28 GMT-07:00 Antonio Murgia <>:

> Hello,
> I looked around on the web and I couldn’t find any way to deal in a
> structured way with malformed/faulty records during computation. All I was
> able to find was the flatMap/Some/None technique + logging.
> I’m facing this problem because I have a processing algorithm that
> extracts more than one value from each record, but can fail in extracting
> one of those multiple values, and I want to keep track of them. Logging is
> not feasible because this “warning” happens so frequently that the logs
> would become overwhelming and impossibile to read.
> Since I have 3 different possible outcomes from my processing I modeled it
> with this class hierarchy:
> That holds result and/or warnings.
> Since Result implements Traversable it can be used in a flatMap,
> discarding all warnings and failure results, in the other hand, if we want
> to keep track of warnings, we can elaborate them and output them if we need.
> Kind Regards
> #A.M.

"Good judgment comes from experience.
Experience comes from bad judgment"

View raw message