hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: Does LOAD ... AS constitute a projection?
Date Mon, 02 Jun 2008 17:48:45 GMT
I in generally agree with Pradeep that it is cleaner for the user to
declare all fields rather than use half names and half positions.
However, I could also see the case where the data has very wide schema
(say 25 columns) and the script users first 4 and then the field 25.
Forcing the user to declare all 25 fields seems excessive. I wonder if
we should allow to optionally include column position in the schema -
sparse schema.


> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com] 
> Sent: Monday, June 02, 2008 10:19 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: Does LOAD ... AS constitute a projection?
> I think the latter option of dropping fields more closely 
> matches user intent. Since the user gave a schema in the 
> load, it seems fair to assume that he is interested only in 
> the fields declared and hence expects to see only those 
> fields in output IMHO.
> -Pradeep
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Monday, June 02, 2008 9:59 AM
> To: pig-dev@incubator.apache.org
> Subject: Does LOAD ... AS constitute a projection?
> This mail applies only to changes in the types branch.
> The central question is whether the declaration of fields in 
> LOAD ... AS
> constitutes a projection or not.
> With the changes in the type branch, we are allowing users to 
> declare types for fields in the load like this:
> A = LOAD 'myfile' AS (a: int, b:float);
> We would like to implement this as:
> A = LOAD 'myfile';
> A' = FOREACH A generate (int)$0, (float)$1;
> and then let the optimizer push that conversion as far down 
> as possible,
> or  completely remove it in cases where a declared field is 
> never used.
> But consider a pig latin script such as:
> A = LOAD 'myfile' AS (a: int, b: float); B = FILTER A BY a > 
> 0; C = SORT B by a; STORE C;
> What if a given tuple has 3 fields instead of 2?  Is that 
> field anonymously carried along and stored as part of C?  Or 
> does the AS in LOAD constitute a projection, so that it's 
> legal to lop off any fields past the second (b)?
> In favor of carrying it along is the argument that we 
> shouldn't force the user to declare all data in a file, maybe 
> he only wants to declare a
> few fields he needs to work with but he still wants to store 
> all the rest.
> In favor of lopping it is that the user told us about his 
> data, we're justified in assuming that he described it 
> completely.  It is also easier to implement this way, as it 
> allows us to make a set of optimization assumptions.
> Thoughts?
> Alan.

View raw message