pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth J <buckeye.prasa...@gmail.com>
Subject Re: Handle NULL values in Cube dimensions
Date Tue, 12 Jun 2012 05:53:16 GMT
Thanks Jon for your thoughts.

I have a patch which renames the null values in dimension values to "unknown" and use null
for rollups. For a sample input tuple

red, null, 12

a = cube inp by ($0, $1);
the above query will emit following combinations 

red, unknown, 12
 , unknown, 12
red, , 12
 , , 12

Please let me know if anyone have different opinion on this. 
If the above choice looks good I can go ahead and submit the patch in JIRA. Thanks.

Thanks
-- Prasanth

On Jun 8, 2012, at 11:06 PM, Jonathan Coveney wrote:

> you could always make the value pluggable, going with Unknown for now, and
> then down the line if we want, we could add an "ONNULL" value to the parser
> that sets it.
> 
> 2012/6/8 Prasanth J <buckeye.prasanth@gmail.com>
> 
>> Thanks Alan and Dmitriy for your thoughts.
>> 
>> I think we have two different approaches now.
>> 
>> In one approach, if we encounter a null in dimension values we can just
>> label it as "unknown" and use "NULL" string to represent rollups. Whereas,
>> in other approach, if we encounter a null in dimension values, use the null
>> value as such but use "*" or any other string for rollups.
>> 
>> Both approaches looks good to me. Please let me know which one should I go
>> ahead with.
>> 
>> Thanks
>> -- Prasanth
>> 
>> On Jun 8, 2012, at 12:22 PM, Alan Gates wrote:
>> 
>>> Option 1 (throwing an error) is bad.  It violates "Pigs eat anything"
>> (see http://pig.apache.org/philosophy.html).
>>> 
>>> Do we need to give users an ability to name this unknown column?  Why
>> not just label it "unknown" and be done?
>>> 
>>> Alan.
>>> 
>>> On Jun 6, 2012, at 2:24 PM, Prasanth J wrote:
>>> 
>>>> Hello everyone
>>>> 
>>>> I would like to bring up this discussion about the ways for handling
>> NULL values in dimensions specified for cubing. For example, if we have a
>> dimension color with following values
>>>> 
>>>> red
>>>> blue
>>>> null
>>>> green
>>>> 
>>>> how do we differentiate if the null value represent rollup of all
>> colors values or actual null value?
>>>> 
>>>> SQL way:
>>>> There are 2 ways in which SQL server analysis services handles null
>> values in dimensions
>>>> 1) Throw error when it encounters null values in dimension values
>>>> 2) Ignore error by adding the null values to UnknownMembers. By default
>> UnknownMembers will be named as "Unknown". The name for UnknownMembers can
>> also be specified by the user.
>>>> 
>>>> Do we need to handle both ways in Pig? I think the first way (throwing
>> error) is pretty straightforward.
>>>> For the second way (ignoring error), what is the best way to provide
>> support for user specified name for UnknownMembers?
>>>> 
>>>> Please share your thoughts about how we can handle this scenario for
>> different datatypes in Pig.
>>>> 
>>>> Thanks
>>>> -- Prasanth
>>>> 
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message