pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Dai <jiany...@yahoo-inc.com>
Subject Semantic cleanup: How to adding two bytearray
Date Fri, 14 Jan 2011 04:58:46 GMT
One goal of semantic cleanup work undergoing is to clarify the usage of 
unknown type.

In Pig schema system, user can define output schema for 
LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. 
Defining schema for LoadFunc/EvalFunc is optional. If user don't define 
schema, Pig will mark them bytearray. However, in the run time, user can 
feed any data type in. Before, Pig assumes the runtime type for 
bytearray is DataByteArray, which arose several issues (PIG-1277, 
PIG-999, PIG-1016).

In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the 
object to figure out what the real type is at runtime. We've done that 
for all shuffle keys (PIG-1277). However, there are other cases. One 
case is adding two bytearray. For example,

a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader 
does not define schema, but actually feed Integer
b = foreach a generate a0+a1;

In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of 
a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and 
mark the output schema for a0+a1 as double. Here is something 
interesting, SomeLoader loads Integer, and we get Double after adding. 
We can change it if we do the following:
1. Don't cast bytearray into Double (in TypeCheckingVisitor)
2. Change POAdd(Similarly, all other ExpressionOperators, multply, 
divide, etc) to handle bytearray. When the schema for POAdd is 
bytearray, Pig will figure out the data type at runtime, and process 
adding according to the real type

1. Consistent with the goal for unknown type cleanup: treat all 
bytearray as unknown type. In the runtime, inspect the object to find 
the real type

1. Slow down the processing since we need to inspect object type at runtime
2. Bring some indeterminism to schema system. Before a0+a1 is double, 
downstream schema is more clear.

Any comments?


View raw message