Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EE8CCDD64 for ; Mon, 20 May 2013 04:36:27 +0000 (UTC) Received: (qmail 52079 invoked by uid 500); 20 May 2013 04:36:26 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 51810 invoked by uid 500); 20 May 2013 04:36:24 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 51777 invoked by uid 99); 20 May 2013 04:36:23 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 May 2013 04:36:23 +0000 Received: from localhost (HELO mail-vb0-f54.google.com) (127.0.0.1) (smtp-auth username omalley, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 May 2013 04:36:22 +0000 Received: by mail-vb0-f54.google.com with SMTP id f13so1078171vbg.27 for ; Sun, 19 May 2013 21:36:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=fncmVAv2ODdEx3nrcZzD1qSVD3O2GIS7Pcy5z9ym+PA=; b=llPetbln5MK45FiWSdClwUxwPFSVKSfolnW+N7OhPp4apTR60J1RWOYlhpfIeWKgpg C/IUfw+iX3hwssVc5d4qtHWfzb+AGpSbbruWBvkGxI8WZxFhfPkB6s9++PLjdY01m+CM jDspdTJf2UM9++hK7H5c4KWMZrwbc5OxgxLh3mUceYFhE0cu7A6SQxCDyfFI+8nIn1fA 4kf7+Jc5nmtJOOu5xuAfssgwcj1XlcHqb7Nm5+UcwksZQ5M7oTWlsfFjdreu2w+LbTDs bumQDqFzfBxtz051M/oI3Xq2MQs7Y9ZqTKt+zX+9slpPqLXC4IYQhXVawZQURRUa+E7N agiw== MIME-Version: 1.0 X-Received: by 10.221.4.131 with SMTP id oc3mr11022889vcb.49.1369024581892; Sun, 19 May 2013 21:36:21 -0700 (PDT) Received: by 10.52.228.134 with HTTP; Sun, 19 May 2013 21:36:21 -0700 (PDT) In-Reply-To: <458BA7AF19306B4FAA5FADDAD6D223353F00B4CF@BLUPRD0811MB401.namprd08.prod.outlook.com> References: <458BA7AF19306B4FAA5FADDAD6D223353EFF924A@BLUPRD0811MB401.namprd08.prod.outlook.com> <458BA7AF19306B4FAA5FADDAD6D223353EFF983F@BLUPRD0811MB401.namprd08.prod.outlook.com> <458BA7AF19306B4FAA5FADDAD6D223353F00B4CF@BLUPRD0811MB401.namprd08.prod.outlook.com> Date: Sun, 19 May 2013 21:36:21 -0700 Message-ID: Subject: Re: Filtering From: "Owen O'Malley" To: Peter Marron Cc: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=089e01293fe2e4b14904dd1edfa0 --089e01293fe2e4b14904dd1edfa0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sun, May 19, 2013 at 3:11 PM, Peter Marron < Peter.Marron@trilliumsoftware.com> wrote: > Hi Owen,**** > > ** ** > > Firstly I want to say a huge thank you. You have really helped me > enormously. > You're welcome. **** > > OK. I think that I get it now. In my custom InputFormat I can read the > config settings > ** ** > > JobConf .get(=E2=80=9C"hive.io.filter.text"=E2=80=9D);**** > > JobConf .get(=E2=80=9C"hive.io.filter.expr.serialized"=E2=80=9D); > well, you don't need double quotes, but yes. > **** > > ** ** > > And so I can then find the predicate that I need to do the filtering.**** > > In particular I can set the input splits so that it just reads the right > records. > Right. You want the serialized one, because there is an API to convert it back to a data structure. > **** > > 1) **I didn=E2=80=99t know about HIVE-2925 and I would never have th= ought > that suppressing the > > Map/Reduce would be controlled by something called > =E2=80=9Chive.fetch.task.conversion=E2=80=9D**** > > So maybe I=E2=80=99m missing a trick. How should I have found out about H= IVE-2925? > There isn't a "trick" other than being willing to ask on the user lists and use your favorite search engine. As Hive developers, we absolutely need to make more things happen automatically and reduce the need to know specific magic incantations. Or at least document the magic incantations. *smile* > **** > > **2) **I would like to parse the filter.expr.serialized XML and I > assume that there=E2=80=99s some > SAX, DOM or even XLST already in HIVE. Could you give me a pointer to > which classes > are used (JAXP, Xerces, Xalan?) or where they are being used? Not > important, > I=E2=80=99m just being lazy. > If you look at pushFilters, it is using Utilities.serializeExpression, so Utilities.deserializeExpression will reverse it. > **** > > **3) **I really want to do my filtering in the getSplits of my > custom InputFormat. However > I have found that my getSplits is not being called. (And I asked about > this on the list > before.) I have found that if I do this > set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.HiveInputFormat > then my method is invoked. It seems to be something to do with avoiding > the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class. > However I don=E2=80=99t know whether there are any other bad things that = will > happen > if I make this change as I don=E2=80=99t really know what I=E2=80=99m doi= ng. > Is this a safe thing to do? > Yes, that is a fine thing to do. It does mean that you'll need to ensure you don't have too many maps, but other than that you should be ok. The primary purpose of CombineHiveInputFormat is to allow Mappers to read from multiple files. > However I would like to say thanks again. If we ever meet in the real wor= ld > > I=E2=80=99ll stand you a beer (or equivalent). > Sounds good, although I'll take the equivalent, since I don't enjoy alcohol= . > **** > > ** ** > > Congratulations on version 0.11.0. > Thanks! -- Owen --089e01293fe2e4b14904dd1edfa0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Sun, May 19, 2013 at 3:11 PM, Peter Marron <Peter.Marron@trilliumsoftware.com> wrote:

Hi Owen,

=C2=A0

Firstly I want to say a h= uge thank you. You have really helped me enormously.


You're w= elcome.=C2=A0

OK. I think that I get it now. In my custom = InputFormat I can read the config settings

=C2=A0

JobConf .get(=E2=80=9C&qu= ot;hive.io.filter.text"=E2=80=9D);

JobConf .get(=E2=80=9C&qu= ot;hive.io.filter.expr.serialized"=E2=80=9D);


= well, you don't need double quotes, but yes.
=C2=A0

=C2=A0

And so I can then find th= e predicate that I need to do the filtering.

In particular I can set t= he input splits so that it just reads the right records.


Right. You w= ant the serialized one, because there is an API to convert it back to a dat= a structure.
=C2=A0

1)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I didn=E2=80=99t= know about HIVE-2925 and I would never have thought that suppressing the

Map/Reduce would be controlled by something c= alled =E2=80=9Chive.fetch.task.conversion=E2=80=9D

So maybe I=E2=80=99m missing a trick. How sho= uld I have found out about HIVE-2925?

There isn't a "trick" other than bein= g willing to ask on the user lists and use your favorite search engine. As = Hive developers, we absolutely need to make more things happen automaticall= y and reduce the need to know specific magic incantations. Or at least docu= ment the magic incantations. *smile*

=

2)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I would like to pars= e the filter.expr.serialized XML and I assume that there=E2=80=99s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which= classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important= ,
I=E2=80=99m just being lazy.

=
=C2=A0If you look at pushFilters, it is using Utili= ties.serializeExpression, so Utilities.deserializeExpression will reverse i= t.

3)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I really want to do = my filtering in the getSplits of my custom InputFormat. However
I have found that my getSplits is not being called. (And I asked about this= on the list
before.) I have found that if I do this
set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.HiveInputFormat
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class. However I don=E2=80=99t know whether there are any other bad things that wi= ll happen
if I make this change as I don=E2=80=99t really know what I=E2=80=99m doing= .
Is this a safe thing to do?

<= /blockquote>
Yes, that is a fine thing to do. It does mean that y= ou'll need to ensure you don't have too many maps, but other than t= hat you should be ok. The primary purpose of CombineHiveInputFormat is to a= llow Mappers to read from multiple files.

However I would like to say than= ks again. If we ever meet in the real world

I=E2=80=99ll stand you a = beer (or equivalent).


Sounds good, although I'll take the equivalent, si= nce I don't enjoy alcohol.
=C2=A0

=C2=A0

Congratulations on versio= n 0.11.0.


Thanks!

-- Owen= =C2=A0
--089e01293fe2e4b14904dd1edfa0--