datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-129) New macro - dedup
Date Thu, 11 Oct 2018 01:01:00 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645791#comment-16645791
] 

Matthew Hayes commented on DATAFU-129:
--------------------------------------

Okay I see.  My suggestion is to just copy and paste ExtremalTupleByNthField as you attempted
and verified as working.  Let's just make sure that the namespace will be {{datafu.org.apache.pig.piggybank.ExtremalTupleByNthField}}
or whatever.  This will achieve the same result we want.  The only downside is we aren't
referencing the JAR, which we can look more into later.  How does this sound?  This is basically
a new option #4 :)

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129-bad.patch, DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically
a date updated field).
> One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF
in PiggyBank. I've added it to the test dependencies in order for the test to run. While I
feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true
- do we have an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message