datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DATAFU-129) New macro - dedup
Date Thu, 11 Oct 2018 13:12:00 GMT

     [ https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eyal Allweil updated DATAFU-129:
--------------------------------
    Attachment: DATAFU-129-2.patch

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129-2.patch, DATAFU-129-bad.patch, DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically
a date updated field).
> One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF
in PiggyBank. I've added it to the test dependencies in order for the test to run. While I
feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true
- do we have an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message