datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes" <matthew.terence.ha...@gmail.com>
Subject Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name
Date Mon, 29 Sep 2014 00:56:57 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------



datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java
<https://reviews.apache.org/r/25564/#comment95058>

    Hmm, something just occurred to me.  This does not currently provide the output schema.
 So this is one problem.  But, how do we determine the output schema?  If the output value
is decided dynamically, then it can vary.  One way to address this is to require that all
the other values of the tuple are of the same type.  Then you just take the schema form the
first value.  In your example they are all chararray.  But this does limit the uses of this
UDF.


- Matthew Hayes


On Sept. 29, 2014, 12:20 a.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 29, 2014, 12:20 a.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField,
*) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day))
GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many
individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message