datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes" <matthew.terence.ha...@gmail.com>
Subject Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name
Date Fri, 03 Oct 2014 07:30:14 GMT


> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49
> > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49>
> >
> >     Hmm, something just occurred to me.  This does not currently provide the output
schema.  So this is one problem.  But, how do we determine the output schema?  If the output
value is decided dynamically, then it can vary.  One way to address this is to require that
all the other values of the tuple are of the same type.  Then you just take the schema form
the first value.  In your example they are all chararray.  But this does limit the uses of
this UDF.
> 
> Russell Jurney wrote:
>     In practice, this is not an issue. The UDF is used this way, and you can cast it
to what you want.
>     
>     with_value_substitution = FOREACH with_group GENERATE 
>         FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray,
>         *, 
>         (int)$period AS periodSeconds:int;
>     
>     However, I don't see why I can't detect the schema of the field selected and return
that?
> 
> Matthew Hayes wrote:
>     The schema can't be dynamic like that.  I'll have to think about this some more.
 I don't like that we have to cast it like this.  One way we can make this better is to have
the UDF pick the schema that is best fit for the types provided.  For example, if all the
fields are of the same type, like chararray, then the resulting type is chararray.  Otherwise
make the type bytearray and you can cast however you want.  I'd like to hear what other people
think about this.  How about emailing datafu dev?
> 
> Russell Jurney wrote:
>     I will bring it up on the list, but I don't think returning a tuple is weird at all.
It is highly convenient, and 'just works.'

I'm not saying that returning a tuple is weird.  What is weird to me is not defining the schema
of the tuple being returned by the UDF.


- Matthew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField,
*) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day))
GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many
individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message