hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Groves <char...@threerings.net>
Subject Getting query information while loading data
Date Fri, 18 Jan 2008 01:16:10 GMT
I'd like to expose the running query to my loading code for a few  
reasons:

- To allow the schema of the loaded data to be specified by its usage  
in the query, rather than by an explicit AS.  I know the names of the  
fields in my data, so it seems backwards to me to require it to be  
named in the query.  I'd rather use the data access in the query to  
figure out the names of the fields and pass that to my loader to put  
the data in the right place in a tuple.  This also seems like it  
could be nice for CSV data since it generally has the names as the  
first line.

- Following up on using the query to determine the schema, I'd like  
to use the query-determined schema to decide what to load.  My  
storage is broken out into files by field, so if I know which fields  
are used in a query, I can read only those fields and save a huge  
amount of busywork.

- To optimize filter operations using indexes.  For some of my  
fields, I have metadata that tells me the range of values in that  
file.  If I could find all the filter operations on that field, I  
could reject entire files if their values fell outside the filter range.

Are you interested in some patches to do this sort of thing?  If so,  
what's the best way to expose this information to user code?  My very  
basic, initial thinking for the first two use cases is to write a  
LOVisitor and an EvalSpecVisitor to spider through the built query  
and build a schema to pass to an interested load func.  A load func  
indicates its interest by implementing a new interface that takes the  
schema, and it takes responsibility for making a tuple that conforms  
to the schema.  If a load func isn't interested, it just implements  
the current interface and loads all the data in its input stream.

The final use case seems like it would require exposing EvalFuncs and  
the LogicalPlan to user code, so I'm fine with just going after the  
first two for now and figuring that out later.  However, if there's a  
way that's exposed already in the code that I've missed, or if  
there's a better way to do it, I'd like to check it out since it'd be  
hugely beneficial for what I'm doing.

Thanks,
Charlie

Mime
View raw message