Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 372CA1805A for ; Wed, 15 Jul 2015 17:31:33 +0000 (UTC) Received: (qmail 33600 invoked by uid 500); 15 Jul 2015 17:31:31 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 33516 invoked by uid 500); 15 Jul 2015 17:31:31 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 33505 invoked by uid 99); 15 Jul 2015 17:31:31 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jul 2015 17:31:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A9B0CC0098 for ; Wed, 15 Jul 2015 17:31:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.791 X-Spam-Level: * X-Spam-Status: No, score=1.791 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-1.108, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id SLqrr_688HvQ for ; Wed, 15 Jul 2015 17:31:29 +0000 (UTC) Received: from mail-ie0-f174.google.com (mail-ie0-f174.google.com [209.85.223.174]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 1E0EA20EFB for ; Wed, 15 Jul 2015 17:31:29 +0000 (UTC) Received: by iecuq6 with SMTP id uq6so38745496iec.2 for ; Wed, 15 Jul 2015 10:31:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=hUihpUssrrY/J7ouPjwauq9PaU41RrRst2RMofD4sgg=; b=Ee9c27GeSrMVJlvnuKI+gitAAtRf41cjY+2FNTdhF2eVBOCnbcGjaytycWprlfy79w tqk+IxNNHTKMxBQDbHypGfRp8vtnzohJHBmiNpPvD+QaHGbdPYMrAMzCxVtbeeLYhY9e oxUhbusc+oYjwPRfaFEXHNPxD5lwL/EEOzX4VmvwAIwoxY+MR62pvNxHVaYlItTLqu/4 6pe8OzM+99ex+yyjCX7LVbmR4bTOGi+Gx21TwdGkFJQdUvK6ZOLfsszusTapQEQStoro JqVQG00rF4ydd20vEx+DjuZGqH8aH4fI+ayD5kicNpomrrlBLVG8DzUsu+F3FGHzH0Cn LTHA== MIME-Version: 1.0 X-Received: by 10.107.12.104 with SMTP id w101mr5769422ioi.110.1436981488553; Wed, 15 Jul 2015 10:31:28 -0700 (PDT) Received: by 10.36.95.18 with HTTP; Wed, 15 Jul 2015 10:31:28 -0700 (PDT) Date: Wed, 15 Jul 2015 12:31:28 -0500 Message-ID: Subject: Record metadata with RDDs and DataFrames From: RJ Nowling To: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a113dfc3a2c94f4051aed53f6 --001a113dfc3a2c94f4051aed53f6 Content-Type: text/plain; charset=UTF-8 Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire pipeline. I'd also like to filter out and log bad records at various stages. I could use RDD[Either[T]] for everything but that won't work for DataFrames. I was wondering if anyone has had a similar situation and if they found elegant ways to handle this? Thanks, RJ --001a113dfc3a2c94f4051aed53f6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all,

I'm working on an ETL task = with Spark.=C2=A0 As part of this work, I'd like to mark records with s= ome info such as:

1. Whether the record is good or= bad (e.g, Either)
2. Originating file and lines

Part of my motivation is to prevent errors with individual records= from stopping the entire pipeline.=C2=A0 I'd also like to filter out a= nd log bad records at various stages.

I could use = RDD[Either[T]] for everything but that won't work for DataFrames.=C2=A0= I was wondering if anyone has had a similar situation and if they found el= egant ways to handle this? =C2=A0

Thanks,
RJ
--001a113dfc3a2c94f4051aed53f6--