pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Semantic cleanup: How to adding two bytearray
Date Fri, 14 Jan 2011 22:00:15 GMT
I think the big win of static typing is that from examining the script  
alone you can know the output:

A = load 'bla' using BinStorage();
B = foreach A generate $0 + $1;

With static typing $0 and $1 will both be viewed as bytearrays and  
thus will be cast to doubles, regardless of how BinStorage actually  
instantiated them.  With dynamic types we cannot know the answers  
without knowing the data that is fed through.

The downside of the static typing case is that we explicitly allow  
unknown types in maps:

A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map
                                                        -- and that m  
has two keys, k1 and k2
                                                        -- both with  
integer values
B = foreach  A generate m#k1 + m#k2;

Using static types, B.$0 will be a double, even though the underlying  
types are ints.  Users will not see that as intuitive even though the  
semantic is clear.  In the dynamic model proposed by Daniel, B.$0 will  
be an int.

We are mitigating this case by allowing typed maps (where the value  
type of the map is declarable) in 0.9.  But maps with heterogenous  
values types will still suffer from this issue.

I vote for static types for several reasons:

1) I like being able to know the output of the script by examining the  
script alone.  It provides a clear semantic that we can explain to  
users.
2) It's less of a maintenance cost, as the need to deal with dynamic  
type discovery is confined to the cast operator.  If we go full out  
dynamic types every expression operator has to be able to manage  
dynamism for byte arrays.
3) In my experience almost all maps are string->string so once we  
allow typed maps I suspect people will start using them heavily.

I'm not sure there's a performance gain either way, since in both  
cases we have to manage the case where we think something is a  
bytearray and it turns out to be something else.

Alan.


On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote:

> Agreed with what Scott said about procedurally building schemas, and  
> what
> Olga said about static typing.
>
> Daniel, I am not sure what you mean about run-time typing on a row  
> by row
> basis.  Certainly winding up with columns that are sometimes doubles,
> sometimes floats, and sometimes ints can only lead to unexpected bugs?
>
> I know Yahoo went through a lot of pain with the LoadStore rework in  
> 0.7
> (heck I am still dealing with it), but seems like breaking  
> compatibility in
> a minor way in order to clean up semantics is ok given that we had a
> "stable" version in between. I don't think conversion would be too  
> onerous,
> especially if declaring schemas is simplified.
>
> We can just say that odd versions can break apis and even can't :).
>
> D
>
> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey  
> <scott@richrelevance.com>wrote:
>
>>
>>
>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <dvryaboy@gmail.com> wrote:
>>
>>> How is runtime detection done? I worry that if 1.txt contains:
>>> 1, 2
>>> 1.1, 2.2
>>>
>>> We get into a situation where addition of the fields in the first  
>>> tuple
>>> produces integers, and adding the fields of the second tuple  
>>> produces
>>> doubles.
>>>
>>> A more invasive but perhaps easier to reason about solution might  
>>> be to be
>>> stricter about types, and require bytearrays to be cast to  
>>> whatever type
>>> they are supposed to be if you want to add / delete / do non-byte- 
>>> things
>>> to
>>> them.
>>>
>>> This is a problem if UDFs that output tuples or bags don't specify  
>>> schemas
>>> (and specifying schemas of tuples and bags is fairly onerous right  
>>> now in
>>> Java). I am not sure what the solution here is, other than finding a
>>> clean,
>>> less onerous way of declaring schemas, fixing up everything in  
>>> builtin and
>>> piggybank to only use the new clean sparkly api and document the  
>>> heck out
>>> of
>>> it.
>>
>> A longer term approach would likely strive to make schema  
>> specification of
>> inputs and outputs for UDFs declarative and restrict the scope of the
>> unknown.  Building schema data structures procedurally is NotFun(tm).
>> All languages could support a string based schema representation,  
>> and many
>> could use more type-safe declarations like Java annotations.  I think
>> there is a long-term opportunity to make Pig's type system easier  
>> to work
>> with and higher performance but its no small project.  Pig  
>> certainly isn't
>> alone with these sort of issues.
>>
>>>
>>> D
>>>
>>> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <jianyong@yahoo-inc.com>
>>> wrote:
>>>
>>>> One goal of semantic cleanup work undergoing is to clarify the  
>>>> usage of
>>>> unknown type.
>>>>
>>>> In Pig schema system, user can define output schema for
>>>> LoadFunc/EvalFunc.
>>>> Pig will propagate those schema to the entire script. Defining  
>>>> schema
>>>> for
>>>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig  
>>>> will
>>>> mark
>>>> them bytearray. However, in the run time, user can feed any data  
>>>> type
>>>> in.
>>>> Before, Pig assumes the runtime type for bytearray is  
>>>> DataByteArray,
>>>> which
>>>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>>>
>>>> In 0.9, Pig will treat bytearray as unknown type. Pig will  
>>>> inspect the
>>>> object to figure out what the real type is at runtime. We've done  
>>>> that
>>>> for
>>>> all shuffle keys (PIG-1277). However, there are other cases. One  
>>>> case is
>>>> adding two bytearray. For example,
>>>>
>>>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume  
>>>> SomeLoader
>>>> does
>>>> not define schema, but actually feed Integer
>>>> b = foreach a generate a0+a1;
>>>>
>>>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the  
>>>> case of
>>>> a0+a1, Pig cast both a0 and a1 to double (in  
>>>> TypeCheckingVisitor), and
>>>> mark
>>>> the output schema for a0+a1 as double. Here is something  
>>>> interesting,
>>>> SomeLoader loads Integer, and we get Double after adding. We can  
>>>> change
>>>> it
>>>> if we do the following:
>>>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>>>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
>>>> divide,
>>>> etc) to handle bytearray. When the schema for POAdd is bytearray,  
>>>> Pig
>>>> will
>>>> figure out the data type at runtime, and process adding according  
>>>> to the
>>>> real type
>>>>
>>>> Pro:
>>>> 1. Consistent with the goal for unknown type cleanup: treat all
>>>> bytearray
>>>> as unknown type. In the runtime, inspect the object to find the  
>>>> real
>>>> type
>>>>
>>>> Cons:
>>>> 1. Slow down the processing since we need to inspect object type at
>>>> runtime
>>>> 2. Bring some indeterminism to schema system. Before a0+a1 is  
>>>> double,
>>>> downstream schema is more clear.
>>>>
>>>> Any comments?
>>>>
>>>> Daniel
>>>>
>>
>>


Mime
View raw message