Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B0E0B200B86 for ; Sun, 18 Sep 2016 12:45:25 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AF61B160AC3; Sun, 18 Sep 2016 10:45:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EDF21160AC0 for ; Sun, 18 Sep 2016 12:45:24 +0200 (CEST) Received: (qmail 84124 invoked by uid 500); 18 Sep 2016 10:45:24 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 84113 invoked by uid 99); 18 Sep 2016 10:45:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Sep 2016 10:45:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id ABCC2C0115 for ; Sun, 18 Sep 2016 10:45:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -5.446 X-Spam-Level: X-Spam-Status: No, score=-5.446 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id caUTwJKoCi5v for ; Sun, 18 Sep 2016 10:45:22 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 93E7C5F24B for ; Sun, 18 Sep 2016 10:45:21 +0000 (UTC) Received: (qmail 83820 invoked by uid 99); 18 Sep 2016 10:45:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Sep 2016 10:45:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 97A522C0050 for ; Sun, 18 Sep 2016 10:45:20 +0000 (UTC) Date: Sun, 18 Sep 2016 10:45:20 +0000 (UTC) From: "Eyal Allweil (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DATAFU-119) New UDF - TupleDiff MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 18 Sep 2016 10:45:25 -0000 [ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500764#comment-15500764 ] Eyal Allweil commented on DATAFU-119: ------------------------------------- I've run it on results that were in the tens of millions. I think the main reason for using it / including it in DataFu is that if you're developing Pig code, and running it on a cluster (or on any given environment), being able to stay in the Pig ecosystem is convenient for fast development cycles. If your original job can run on the given environment, a comparison job can run their efficiently, too. And there's less copying because you leave the previous results in the hdfs under a different name, and compare easily. The output is human-readable, but the expected results is that most records return null, because they're identical, and the ones that do come out are usually edge cases that turned out different. That's the reasoning behind having "something" like this UDF. The output type and it's not having a schema is a different story - it would be better to have a schema. But I'm hesitant to spend the time to do it if it isn't likely that someone else will want to write a different output format for it. > New UDF - TupleDiff > ------------------- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers. > It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)