Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CADtDQQK099=czsQMUi-YVwL=awC7Jihh0MoTKhB0rBBc-8MKFg@mail.gmail.com>
References: 
 <CADtDQQK099=czsQMUi-YVwL=awC7Jihh0MoTKhB0rBBc-8MKFg@mail.gmail.com>
From: Reynold Xin <rxin@databricks.com>
Date: Wed, 15 Jul 2015 10:36:35 -0700
Message-ID: 
 <CAPh_B=Y+-UDLZU9nRYiTOpEoLaz4ev0+xcKjboSUcZGLOUcWJA@mail.gmail.com>
Subject: Re: Record metadata with RDDs and DataFrames
To: RJ Nowling <rnowling@gmail.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a1147b63ca43233051aed6689

--001a1147b63ca43233051aed6689
Content-Type: text/plain; charset=UTF-8

How about just using two fields, one boolean field to mark good/bad, and
another to get the source file?


On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowling@gmail.com> wrote:

> Hi all,
>
> I'm working on an ETL task with Spark.  As part of this work, I'd like to
> mark records with some info such as:
>
> 1. Whether the record is good or bad (e.g, Either)
> 2. Originating file and lines
>
> Part of my motivation is to prevent errors with individual records from
> stopping the entire pipeline.  I'd also like to filter out and log bad
> records at various stages.
>
> I could use RDD[Either[T]] for everything but that won't work for
> DataFrames.  I was wondering if anyone has had a similar situation and if
> they found elegant ways to handle this?
>
> Thanks,
> RJ
>

--001a1147b63ca43233051aed6689
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">How about just using two fields, one boolean field to mark=
 good/bad, and another to get the source file?<div><br></div></div><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Jul 15, 2015 at 1=
0:31 AM, RJ Nowling <span dir=3D"ltr">&lt;<a href=3D"mailto:rnowling@gmail.=
com" target=3D"_blank">rnowling@gmail.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">Hi all,<div><br></div><div>I&#39;m =
working on an ETL task with Spark.=C2=A0 As part of this work, I&#39;d like=
 to mark records with some info such as:</div><div><br></div><div>1. Whethe=
r the record is good or bad (e.g, Either)</div><div>2. Originating file and=
 lines</div><div><br></div><div>Part of my motivation is to prevent errors =
with individual records from stopping the entire pipeline.=C2=A0 I&#39;d al=
so like to filter out and log bad records at various stages.</div><div><br>=
</div><div>I could use RDD[Either[T]] for everything but that won&#39;t wor=
k for DataFrames.=C2=A0 I was wondering if anyone has had a similar situati=
on and if they found elegant ways to handle this? =C2=A0</div><div><br></di=
v><div>Thanks,</div><div>RJ</div></div>
</blockquote></div><br></div>

--001a1147b63ca43233051aed6689--