hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Getting query information while loading data
Date Fri, 18 Jan 2008 19:45:27 GMT
We're definitely interested.

Our thinking of how to provide field metadata (name and eventually 
types) for pig queries was to allow several options:
    1) AS in the LOAD, as you can currently do for names.
    2) using an outside metadata service, where we would tell it the 
file name and it would tell us the metadata.
    3) Support self describing data formats such as JSON.

You're suggestion for a very simple schema provided in the first line of 
the file falls under category 3.  The trick here is that we need to be 
able to read that metadata about the fields at parse time (because we'd 
like to be able to do type checking and such).  So in addition to the 
load function itself needing to examine the tuples, we need a way for 
the load function to read just enough of the file to tell the front end 
(on the client box, not on the map-reduce backend) the schema.  Maybe 
the best way to implement this is to have an interface that the load 
function would implement that lets the parser know that the load 
function can discover the metadata for it, and then the parser could 
call that load function before proceeding to type checking.

We're also interested in being able to tell the load function the fields 
needed in the query.  Even if you don't have field per file storage (aka 
columnar storage) it's useful to be able to immediately project out 
fields you know the query won't care about, as you can avoid translation 
costs and memory storage.

It's not clear to me that we need another interface to implement this.  
We could just add a method "void neededColumns(Schema s)" to PigLoader.  
As a post parsing step the parser would then visit the plan, as you 
suggest, and submit a schema to the PigLoader function.  It would be up 
to the specific loader implementation to decide whether to make use of 
the provided schema or not.



Charlie Groves wrote:
> I'd like to expose the running query to my loading code for a few 
> reasons:
> - To allow the schema of the loaded data to be specified by its usage 
> in the query, rather than by an explicit AS.  I know the names of the 
> fields in my data, so it seems backwards to me to require it to be 
> named in the query.  I'd rather use the data access in the query to 
> figure out the names of the fields and pass that to my loader to put 
> the data in the right place in a tuple.  This also seems like it could 
> be nice for CSV data since it generally has the names as the first line.
> - Following up on using the query to determine the schema, I'd like to 
> use the query-determined schema to decide what to load.  My storage is 
> broken out into files by field, so if I know which fields are used in 
> a query, I can read only those fields and save a huge amount of busywork.
> - To optimize filter operations using indexes.  For some of my fields, 
> I have metadata that tells me the range of values in that file.  If I 
> could find all the filter operations on that field, I could reject 
> entire files if their values fell outside the filter range.
> Are you interested in some patches to do this sort of thing?  If so, 
> what's the best way to expose this information to user code?  My very 
> basic, initial thinking for the first two use cases is to write a 
> LOVisitor and an EvalSpecVisitor to spider through the built query and 
> build a schema to pass to an interested load func.  A load func 
> indicates its interest by implementing a new interface that takes the 
> schema, and it takes responsibility for making a tuple that conforms 
> to the schema.  If a load func isn't interested, it just implements 
> the current interface and loads all the data in its input stream.
> The final use case seems like it would require exposing EvalFuncs and 
> the LogicalPlan to user code, so I'm fine with just going after the 
> first two for now and figuring that out later.  However, if there's a 
> way that's exposed already in the code that I've missed, or if there's 
> a better way to do it, I'd like to check it out since it'd be hugely 
> beneficial for what I'm doing.
> Thanks,
> Charlie

View raw message