spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Record metadata with RDDs and DataFrames
Date Wed, 15 Jul 2015 17:31:28 GMT
Hi all,

I'm working on an ETL task with Spark.  As part of this work, I'd like to
mark records with some info such as:

1. Whether the record is good or bad (e.g, Either)
2. Originating file and lines

Part of my motivation is to prevent errors with individual records from
stopping the entire pipeline.  I'd also like to filter out and log bad
records at various stages.

I could use RDD[Either[T]] for everything but that won't work for
DataFrames.  I was wondering if anyone has had a similar situation and if
they found elegant ways to handle this?

Thanks,
RJ

Mime
View raw message