Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@datafu.incubator.apache.org
Date: Fri, 16 Sep 2016 18:52:20 +0000 (UTC)
From: "Matthew Hayes (JIRA)" <jira@apache.org>
To: dev@datafu.incubator.apache.org
Message-ID: <JIRA.12981314.1466508500000.592000.1474051940815@Atlassian.JIRA>
In-Reply-To: <JIRA.12981314.1466508500000@Atlassian.JIRA>
References: <JIRA.12981314.1466508500000@Atlassian.JIRA> <JIRA.12981314.1466508500414@arcas>
Subject: [jira] [Commented] (DATAFU-119) New UDF - TupleDiff
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 16 Sep 2016 18:52:26 -0000


    [ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497089#comment-15497089 ] 

Matthew Hayes commented on DATAFU-119:
--------------------------------------

Hey Eyal, sorry for taking so long to review this.  Before I get too deep into looking at the code I wanted to discuss your comment about whether it's too specific for general use.  One thing I'm curious about is how big the data is that you're comparing.  This UDF is focused on human readability, so to me this implies the data you're working with is small.  If that's the case then I wonder whether having a UDF for this makes sense.  Did you consider a tool that pulls down the two (perhaps small) datasets and compares them locally?  I wonder whether running this as a Hadoop job would be overkill.  I think for myself, if I were to be interested in diffing a small data set, I would write a general tool that pulls the data locally and compares it because I wouldn't want to wait for cluster availability for something so small.  This also gives you more options for output that can help with readability.  Thoughts?

> New UDF - TupleDiff
> -------------------
>
>                 Key: DATAFU-119
>                 URL: https://issues.apache.org/jira/browse/DATAFU-119
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers.
> It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs {
> 	DEFINE TupleDiff datafu.pig.util.TupleDiff;
> 		
> 	old = 	FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original;
> 	new = 	FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original;
> 	
> 	join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
> 		
> 	join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original;
> 		
> 	$diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,<original tuple>,
> missing,,<new tuple>
> changed field2 field4,<original tuple>,<new tuple>
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)