hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: Pig and missing metadata
Date Thu, 02 Oct 2008 18:29:16 GMT
Alan Gates wrote:
> Mridul Muralidharan wrote:
>> Alan Gates wrote:
>>> Case 3 is not yet supported, and supporting it will require some 
>>> changes to pig's backend implementation.  Specifically it will need 
>>> to be able to handle the case where pig guessed that a datum was of 
>>> one type, but it turns out to be another.  To use the example above, 
>>> if MyLoader actually loaded $0 as a double, then pig needs to adapt 
>>> to this.
>> union is quite common actually - so some way to handle it would be 
>> quite useful.
> We certainly plan to support union fully.

Great ! Thanks for clarifying ... this is something we use extensively 
(along with cross).

>>> 2) Pigs fly.  We want to choose performance over generality.  In the 
>>> example above, it is safer to always convert $0 to double, because as 
>>> long as $0 is some kind of number you can do the conversion.  If $0 
>>> really is a double and pig treats it as an int it will be truncating 
>>> it.  But treating it as an int is 10 times faster than treating it as 
>>> a double.  And the user can specify it as "$0 + 1.0" if they really 
>>> want the double.
>> I disagree - it is better to treat it as a double, and warn user about 
>> the performance implications - than to treat it as an int and generate 
>> incorrect results.
>> Correctness is more important than performance.
> This is not a correctness issue.  When we are guessing the type, we will 
> always be wrong sometimes.  If we say $0 + 1 implies an int, and $0 has 
> double data then we'll return 3 when the user wanted 3.14.  If we say $0 
> + 1 is a double and $0 has int data, then we'll return 42.0 when the 
> user wanted 42.  42.0 is closer to 42 than 3 is to 3.14, but if the user 
> has given us all int data and added an integer to it, and we output 
> double data, that's still not what the user wanted.
> Given that we will always be wrong sometimes, the question is when do we 
> want to be wrong.  In this case I advocate in favor of ints for 2 reasons:
> 1) Performance, as noted above.  Integer computations are about 10x 
> faster than double computations.
> 2) Frequency of use.  In my experience integral numbers are far more 
> common in databases than floating points (obviously this depends on the 
> data you're processing).
> So 90% of the time we'll produce what the user wants and run 10x faster 
> given this assumption, and the other 10% we'll produce a number that 
> isn't exactly what the user wanted.  If the user wants the double, he 
> can explicitly cast $0 or add 1.0 (instead of 1) to it.

A simple snippet which would handle output for both integer and double 
is given below. Double.toString() in codepaths which fall under this 
category could be replaced with it.

-- start --
NumberFormat nf = NumberFormat.getInstance();

loop :

String doubleString = nf.format(<double value>);
-- end --

Performance characterstics is definitely worse than in case of directly 
using Integer.toString() Double.toString() - but the results will always 
be correct.

The snippet above is illustrative - you could hack up something which is 
faster & better (even something which delegates to 
Long.toString()/Double.toString() depending on input for example).

Implicit assumptions made about user input should always satisfy 
principle of least astonishment - even at cost of performance ... imho 
performance is always secondary to correctness and functionality.
Warning/error messages to indicate the loss of performance is definitely 
required though.

>>> 4) Pigs are friendly when treated nicely.  In the cases where the 
>>> user or udf didn't tell pig the type, it isn't an error if the type 
>>> of the datum doesn't match the operation.  Again, using the example 
>>> above, if $0 turns out (at least in some cases) to be a chararray 
>>> which cannot be cast to int, then a null datum plus a warning will be 
>>> emitted rather than an error.
>> This looks more like incompatibility of the input data with the 
>> script/udf, no ?
>> For example, if script declares column 1 is integer, and it turns in 
>> the file to be chararray, then either :
>> a) it is a schema error in the script - and it is useless to continue 
>> the pipeline.
>> b) it is an anomaly in input data.
>> c) space for rent.
>> Different usecases might want to handle (b) differently - (a) is 
>> universally something which should result in flagging the script as an 
>> error. Not really sure how you will make the distinction between (a) 
>> and (b) though ...
> In case 4 here I'm not talking about the situation where the user gave 
> us a schema and it turns out to be wrong.  That falls under case 1, 
> don't lie to the pig.  I'm thinking here of situations where the user 
> doesn't tell us what the data is or where the data is different row to 
> row because of a union or just inconsistent data, which pig does allow.

My assumption about (b) was not a result of incorrect data or incorrect 
schema - but due to things like comingling of different tables through 
union/etc, udf output, etc - which result in no schema being specified - 
and where inference is used.

Not something I normally hit - though if I do, I would prefer exceptions 
to silent creeping errors ... though it is quite logical to expect 
different behavior too.

My 2cents, ofcourse ymmv :-)


> Alan.

View raw message