Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DD9C2200B26 for ; Mon, 27 Jun 2016 08:51:56 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DC2B5160A5B; Mon, 27 Jun 2016 06:51:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2FD42160A3C for ; Mon, 27 Jun 2016 08:51:56 +0200 (CEST) Received: (qmail 9586 invoked by uid 500); 27 Jun 2016 06:51:55 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 9575 invoked by uid 99); 27 Jun 2016 06:51:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2016 06:51:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 29664CA0F6 for ; Mon, 27 Jun 2016 06:51:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.994 X-Spam-Level: X-Spam-Status: No, score=-4.994 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.974] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Gea1yWUFa8ZO for ; Mon, 27 Jun 2016 06:51:53 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with SMTP id 10A965F263 for ; Mon, 27 Jun 2016 06:51:52 +0000 (UTC) Received: (qmail 9529 invoked by uid 99); 27 Jun 2016 06:51:52 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2016 06:51:52 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 02FB42C1F5C for ; Mon, 27 Jun 2016 06:51:52 +0000 (UTC) Date: Mon, 27 Jun 2016 06:51:52 +0000 (UTC) From: "Eyal Allweil (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DATAFU-119) New UDF - TupleDiff MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 27 Jun 2016 06:51:57 -0000 [ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350489#comment-15350489 ] Eyal Allweil commented on DATAFU-119: ------------------------------------- I put up a [reviewboard|https://reviews.apache.org/r/49248/] for this. After some internal discussions, I wonder if the output isn't too specific for general use - I find it very convenient during development for comparing outputs, but it's very much skewed towards human-readability - to make it easy to use the output in Pig, it should have a real schema, not chararray - possibly something with the field names from the original tuples, but boolean or int values to indicate change types. I'd be happy to hear feedback about this. > New UDF - TupleDiff > ------------------- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers. > It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)