hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: View Partition Pruning not Occurring during transform
Date Thu, 11 Oct 2012 13:32:39 GMT
Have you considered rewriting the query using nested from clauses.
Generally if hive is not 'pushing down' as you would assume nesting froms
make the query happen in a specific way.

On Wednesday, October 10, 2012, John Omernik <> wrote:
> Agreed. That's the conclusion we came to as well. So it's less of a bug
and more of a feature request. I think one of the main advantages of hive
is the flexibility in allowing non-technical users to run basic queries
without having to think about the transform stuff. (i.e. we in the IT shop
can setup the transform)  I like the annotation idea that some how the
partition specs can be pushed through (identified in some other way etc).
 I am new to the Apache/JIRA world, what would you recommend for getting
this into a feature request for consideration? I am not a Java programmer,
so my idea may need to be paired with a champion to help implement it :)
> On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar <>
>> I assume the reason for this is that the Hive compiler has no way of
determining that the 'day' that is input into the transform script is the
same 'day' that is output from the transform script. Even if it did, its
unclear if pushing down would be legal without knowing the semantics of the
transformation. Any optimization to be done here will likely need an
annotation somewhere to say that certain columns in the output of a
transform refer to specific columns in the input of a transform for
predicate push down purposes (and that such pushdown is legal for this
>> thanks,
>> Shrikanth
>> On Oct 10, 2012, at 12:04 PM, John Omernik wrote:
>> > Greetings all, I am trying to incorporate a TRANSFORM into a view (so
we can abstract the transform script away from the user)
>> >
>> >
>> >
>> > As a Test, I have a table partitioned on day (in YYYY-MM-DD formated)
with lots of partitions
>> >
>> > and I tried this
>> >
>> > CREATE VIEW view_transform as
>> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;
>> >
>> > The reason I used 'cat' in my test is if this works, I will distribute
my transform scripts to each node manually, I know each node has cat, so
this works as a test.
>> >
>> > When run
>> >
>> > SELECT * from view_transform where day = '2012-10-08'  10,432 map
tasks get spun up.
>> >
>> > If I rewrite the view to be
>> >
>> > CREATE VIEW view_transform as
>> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table
where day = '2012-10-08';
>> >
>> > Then only 16 map tasks get spun up (the desired behavior, but the
pruning is happening in the view not in the query)
>> >
>> > Thus I wanted input on whether this should be considered a bug.  I.e.
Should we be able to define a partition spec in a view that uses a
transform that allows normal pruning to occur even though the partition
spec will be passed to the transfrom script?  I think we should, and it's
likely doable some how. This would be awesome for a number of situations
where you may want to expose "transformed" data to analysis without the
mess of having them format their script for transform.
>> >
>> >

View raw message