hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: Implicit casting on bag operators
Date Wed, 14 May 2008 14:20:38 GMT
Alan,

On my second thought, union of two incompatible data streams can cause
undefined state in downstream operators, resulting in a mix of good output
and garbage. This seems to break the rule of least surprise. What do you
think?

Pi

On Wed, May 14, 2008 at 9:06 AM, pi song <pi.songs@gmail.com> wrote:

> Ok, will follow that.
>
>
> On 5/14/08, Alan Gates <gates@yahoo-inc.com> wrote:
>>
>> I agree that option 3 is the correct course.
>>
>> One note, you say:
>>
>> In case that schemas from all the input ports are not compatible, no
>> problem
>> because we won't process it.
>>
>> How do you mean "won't process it"?  We still have to allow a union
>> operation between two non-compatible inputs (otherwise we can only use union
>> when we have schemas).  But the resulting union will not have a schema
>> (since the output no longer has a consistent schema).
>>
>> Alan.
>>
>>
>> pi song wrote:
>>
>>> Union is an example of bag (relational) operators that can have more than
>>> one input.
>>>
>>> In case that schemas from all the input ports are the same, no problem.
>>> In case that schemas from all the input ports are not compatible, no
>>> problem
>>> because we won't process it.
>>> In case that schemas from all the input ports are not the same, but
>>> compatible, here comes a problem.
>>>
>>> Example:
>>>
>>> C = UNION A,B ;
>>>
>>> Schema(A) = < Int, Chararray >
>>> Schema(B) = < Double, Chararray >
>>>
>>> The output schema will get resolved to < Double, Chararray >. Here is the
>>> problem. The Union operator at the moment doesn't support casting in any
>>> layer. In this case if we don't cast it, the binary data of Int will get
>>> picked up as Double by the downstream operator!! There are a couple
>>> solutions for this:-
>>>
>>> 1) Implement LOUnion and POUnion to support type casting internally
>>> 2) Add casting support in LOUnion operator and let the LogicalToPhysical
>>> compiler generates LOForeach for it.
>>> 3) Explicitly insert LOForEach to do necessary casting between Union and
>>> the
>>> problematic input. This is analogous to the way we implement implicit
>>> casting for expression operators.
>>> 4) Don't support "not same but compatible" case at all.
>>>
>>> I will do (3) because it makes the most sense to me plus incurs the least
>>> impact on other modules. Does anyone have problem with it?
>>>
>>> Pi
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message