datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DATAFU-129) New macro - dedup
Date Tue, 12 Sep 2017 12:50:01 GMT
Eyal Allweil created DATAFU-129:
-----------------------------------

             Summary: New macro - dedup
                 Key: DATAFU-129
                 URL: https://issues.apache.org/jira/browse/DATAFU-129
             Project: DataFu
          Issue Type: New Feature
            Reporter: Eyal Allweil
            Assignee: Eyal Allweil


Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically
a date updated field).

One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank.
I've added it to the test dependencies in order for the test to run. While I feel that anyone
using Pig typically has PiggyBank in the classpath, this might not be true - do we have an
alternative? (maybe adding it to the jarjar?)

The macro's definition looks as follows:

DEFINE dedup(relation, row_key, order_field) returns out {

relation - relation to dedup
row_key - field(s) for group by
order_field - the field for ordering (to find the most recent record)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message