Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B62E1809E for ; Wed, 15 Jul 2015 17:37:05 +0000 (UTC) Received: (qmail 56558 invoked by uid 500); 15 Jul 2015 17:37:04 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 56473 invoked by uid 500); 15 Jul 2015 17:37:04 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 56458 invoked by uid 99); 15 Jul 2015 17:37:03 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jul 2015 17:37:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 59D40C0098 for ; Wed, 15 Jul 2015 17:37:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.891 X-Spam-Level: * X-Spam-Status: No, score=1.891 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-1.108, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id OUGwq-SEYdaa for ; Wed, 15 Jul 2015 17:37:02 +0000 (UTC) Received: from mail-qk0-f171.google.com (mail-qk0-f171.google.com [209.85.220.171]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id C0FAB428DB for ; Wed, 15 Jul 2015 17:37:01 +0000 (UTC) Received: by qkcl188 with SMTP id l188so33299525qkc.1 for ; Wed, 15 Jul 2015 10:36:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=wISE0LIf2PEaEv7pZSVVYp4/BUHOrzrY4uVRFEqhxao=; b=dP1HOqRakFdtzEZfGXNFZQBtrRUqFWfZH2wNcAOWELNPTHkkfTcy/BinW4m2Y5bfLb zEF/MiA0XvpTsjEJ59BDhYU5rQ8/0z0ct78igiNKk0wtLXQuw5vByVH9qGCRPtxXAHlm JG/XV8LwLoiNr5r44thqBH+vYHQY3s5G1XFMrK3LYBK2Y2CIS7XOTyWo8MvvyUmkeRIf EU+XKkSbq5gWRMB5DxM5ICdkJP0BiqVXXgqaCnkqzZppogNFZeFPE7HdXKM6BGVx6CkO IBie2I9SUX4uVDwhLLkajKlFDofvJgTVY85Fz9rF7VqbVlW6LnzsoR/2SCXdhQa6sbN1 bquw== X-Gm-Message-State: ALoCoQnFLDqsEDxAgk0fdDPGYfj0Qb+hjvnJ79ZMOdx/LsPK+dUjK7va+mhRhePOE29HEe/9HtRm X-Received: by 10.55.41.84 with SMTP id p81mr10312932qkh.95.1436981815155; Wed, 15 Jul 2015 10:36:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.96.152.36 with HTTP; Wed, 15 Jul 2015 10:36:35 -0700 (PDT) In-Reply-To: References: From: Reynold Xin Date: Wed, 15 Jul 2015 10:36:35 -0700 Message-ID: Subject: Re: Record metadata with RDDs and DataFrames To: RJ Nowling Cc: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a1147b63ca43233051aed6689 --001a1147b63ca43233051aed6689 Content-Type: text/plain; charset=UTF-8 How about just using two fields, one boolean field to mark good/bad, and another to get the source file? On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling wrote: > Hi all, > > I'm working on an ETL task with Spark. As part of this work, I'd like to > mark records with some info such as: > > 1. Whether the record is good or bad (e.g, Either) > 2. Originating file and lines > > Part of my motivation is to prevent errors with individual records from > stopping the entire pipeline. I'd also like to filter out and log bad > records at various stages. > > I could use RDD[Either[T]] for everything but that won't work for > DataFrames. I was wondering if anyone has had a similar situation and if > they found elegant ways to handle this? > > Thanks, > RJ > --001a1147b63ca43233051aed6689 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
How about just using two fields, one boolean field to mark= good/bad, and another to get the source file?


On Wed, Jul 15, 2015 at 1= 0:31 AM, RJ Nowling <rnowling@gmail.com> wrote:
Hi all,

I'm = working on an ETL task with Spark.=C2=A0 As part of this work, I'd like= to mark records with some info such as:

1. Whethe= r the record is good or bad (e.g, Either)
2. Originating file and= lines

Part of my motivation is to prevent errors = with individual records from stopping the entire pipeline.=C2=A0 I'd al= so like to filter out and log bad records at various stages.

=
I could use RDD[Either[T]] for everything but that won't wor= k for DataFrames.=C2=A0 I was wondering if anyone has had a similar situati= on and if they found elegant ways to handle this? =C2=A0

Thanks,
RJ

--001a1147b63ca43233051aed6689--