spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Record metadata with RDDs and DataFrames
Date Wed, 15 Jul 2015 17:36:35 GMT
How about just using two fields, one boolean field to mark good/bad, and
another to get the source file?


On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowling@gmail.com> wrote:

> Hi all,
>
> I'm working on an ETL task with Spark.  As part of this work, I'd like to
> mark records with some info such as:
>
> 1. Whether the record is good or bad (e.g, Either)
> 2. Originating file and lines
>
> Part of my motivation is to prevent errors with individual records from
> stopping the entire pipeline.  I'd also like to filter out and log bad
> records at various stages.
>
> I could use RDD[Either[T]] for everything but that won't work for
> DataFrames.  I was wondering if anyone has had a similar situation and if
> they found elegant ways to handle this?
>
> Thanks,
> RJ
>

Mime
View raw message