Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 22861926F for ; Thu, 11 Oct 2012 13:33:13 +0000 (UTC) Received: (qmail 24181 invoked by uid 500); 11 Oct 2012 13:33:11 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 23810 invoked by uid 500); 11 Oct 2012 13:33:07 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 23782 invoked by uid 99); 11 Oct 2012 13:33:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 13:33:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-ia0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 13:33:00 +0000 Received: by mail-ia0-f176.google.com with SMTP id h11so1406850iae.35 for ; Thu, 11 Oct 2012 06:32:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=XJrtuxHRCiSDR87+3khtOgf0UevqQJMSobhTb5swrvk=; b=SD3F75YDzxnvvDNelB44BnrEQT2aMnHPHFkWdOWgQJeGjwxKmraf1d1zPdEwlsYaDb TWAkivouB/8RekJ8IkY+T422xakp2sZb/5s7Xg4K0fJi6XXKmJKw+SBC0jxt3rpBFD+X 1AKQmlJ3d5x9QcA8pNivTOvw7Soh6X+aavrPs/gydiIAnPlnMulCaa+1/XONYXYQd37x COpZQ6YuiW+ufp/8aKXSDrxit72q+01i080vmKGZyR2QsksP7WKeeuki7eSgKTzejvQj lFTLv7qQMw9SJSOukwScJeU4dtFzeO0yTsCF77YCYcqiWMLrXxcT3S0kMVLDHg9qeSRz 4foQ== MIME-Version: 1.0 Received: by 10.50.219.230 with SMTP id pr6mr819204igc.29.1349962359172; Thu, 11 Oct 2012 06:32:39 -0700 (PDT) Received: by 10.64.97.106 with HTTP; Thu, 11 Oct 2012 06:32:39 -0700 (PDT) In-Reply-To: References: <53DF7C03-2807-4953-A35E-EBF8E3698529@qubole.com> Date: Thu, 11 Oct 2012 09:32:39 -0400 Message-ID: Subject: Re: View Partition Pruning not Occurring during transform From: Edward Capriolo To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=14dae93411c5e0ef3604cbc89a22 --14dae93411c5e0ef3604cbc89a22 Content-Type: text/plain; charset=ISO-8859-1 Have you considered rewriting the query using nested from clauses. Generally if hive is not 'pushing down' as you would assume nesting froms make the query happen in a specific way. On Wednesday, October 10, 2012, John Omernik wrote: > Agreed. That's the conclusion we came to as well. So it's less of a bug and more of a feature request. I think one of the main advantages of hive is the flexibility in allowing non-technical users to run basic queries without having to think about the transform stuff. (i.e. we in the IT shop can setup the transform) I like the annotation idea that some how the partition specs can be pushed through (identified in some other way etc). I am new to the Apache/JIRA world, what would you recommend for getting this into a feature request for consideration? I am not a Java programmer, so my idea may need to be paired with a champion to help implement it :) > > > On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar wrote: >> >> I assume the reason for this is that the Hive compiler has no way of determining that the 'day' that is input into the transform script is the same 'day' that is output from the transform script. Even if it did, its unclear if pushing down would be legal without knowing the semantics of the transformation. Any optimization to be done here will likely need an annotation somewhere to say that certain columns in the output of a transform refer to specific columns in the input of a transform for predicate push down purposes (and that such pushdown is legal for this transformation) >> >> thanks, >> Shrikanth >> On Oct 10, 2012, at 12:04 PM, John Omernik wrote: >> >> > Greetings all, I am trying to incorporate a TRANSFORM into a view (so we can abstract the transform script away from the user) >> > >> > >> > >> > As a Test, I have a table partitioned on day (in YYYY-MM-DD formated) with lots of partitions >> > >> > and I tried this >> > >> > CREATE VIEW view_transform as >> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table; >> > >> > The reason I used 'cat' in my test is if this works, I will distribute my transform scripts to each node manually, I know each node has cat, so this works as a test. >> > >> > When run >> > >> > SELECT * from view_transform where day = '2012-10-08' 10,432 map tasks get spun up. >> > >> > If I rewrite the view to be >> > >> > CREATE VIEW view_transform as >> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table where day = '2012-10-08'; >> > >> > Then only 16 map tasks get spun up (the desired behavior, but the pruning is happening in the view not in the query) >> > >> > Thus I wanted input on whether this should be considered a bug. I.e. Should we be able to define a partition spec in a view that uses a transform that allows normal pruning to occur even though the partition spec will be passed to the transfrom script? I think we should, and it's likely doable some how. This would be awesome for a number of situations where you may want to expose "transformed" data to analysis without the mess of having them format their script for transform. >> > >> > >> > > --14dae93411c5e0ef3604cbc89a22 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Have you considered rewriting the query using nested from clauses. Generall= y if hive is not 'pushing down' as you would assume nesting froms m= ake the query happen in a specific way.

On Wednesday, October 10, 20= 12, John Omernik <john@omernik.com> wrote:
> Agreed. That's the conclusion we came to as well. So it's less= of a bug and more of a feature request. I think one of the main advantages= of hive is the flexibility in allowing non-technical users to run basic qu= eries without having to think about the transform stuff. (i.e. we in the IT= shop can setup the transform) =A0I like the annotation idea that some how = the partition specs can be pushed through (identified in some other way etc= ). =A0I am new to the Apache/JIRA world, what would you recommend for getti= ng this into a feature request for consideration? I am not a Java programme= r, so my idea may need to be paired with a champion to help implement it :)=
>
>
> On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar <= ;
sshankar@qubole.com> wrote:<= br>>>
>> I assume the reason for this is that the Hive compi= ler has no way of determining that the 'day' that is input into the= transform script is the same 'day' that is output from the transfo= rm script. Even if it did, its unclear if pushing down would be legal witho= ut knowing the semantics of the transformation. Any optimization to be done= here will likely need an annotation somewhere to say that certain columns = in the output of a transform refer to specific columns in the input of a tr= ansform for predicate push down purposes (and that such pushdown is legal f= or this transformation)
>>
>> thanks,
>> Shrikanth
>> On Oct 10, 2= 012, at 12:04 PM, John Omernik wrote:
>>
>> > Greeting= s all, I am trying to incorporate a TRANSFORM into a view (so we can abstra= ct the transform script away from the user)
>> >
>> >
>> >
>> > As a Test,= I have a table partitioned on day (in YYYY-MM-DD formated) with lots of pa= rtitions
>> >
>> > and I tried this
>> >= ;
>> > CREATE VIEW view_transform as
>> > Select TRANSFO= RM (day, ip) using 'cat' as (day, ip) from source_table;
>>= ; >
>> > The reason I used 'cat' in my test is if th= is works, I will distribute my transform scripts to each node manually, I k= now each node has cat, so this works as a test.
>> >
>> > When run
>> >
>> > S= ELECT * from view_transform where day =3D '2012-10-08' =A010,432 ma= p tasks get spun up.
>> >
>> > If I rewrite the vie= w to be
>> >
>> > CREATE VIEW view_transform as
>> &g= t; Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_= table where day =3D '2012-10-08';
>> >
>> >= Then only 16 map tasks get spun up (the desired behavior, but the pruning = is happening in the view not in the query)
>> >
>> > Thus I wanted input on whether this should b= e considered a bug. =A0I.e. Should we be able to define a partition spec in= a view that uses a transform that allows normal pruning to occur even thou= gh the partition spec will be passed to the transfrom script? =A0I think we= should, and it's likely doable some how. This would be awesome for a n= umber of situations where you may want to expose "transformed" da= ta to analysis without the mess of having them format their script for tran= sform.
>> >
>> >
>>
>
> --14dae93411c5e0ef3604cbc89a22--