Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 21343 invoked from network); 8 Nov 2010 06:35:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Nov 2010 06:35:39 -0000 Received: (qmail 38269 invoked by uid 500); 8 Nov 2010 06:36:10 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 38024 invoked by uid 500); 8 Nov 2010 06:36:08 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 38016 invoked by uid 99); 8 Nov 2010 06:36:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 06:36:07 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-ew0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 06:36:01 +0000 Received: by ewy3 with SMTP id 3so2655094ewy.35 for ; Sun, 07 Nov 2010 22:35:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=gQUn85zyXC5wQDz/Km06pvvnLEmvLQKiRuz4BIZLIeQ=; b=je6d+5Y8YdDOfyXYRVGSDFhFlxQfK6oT0dEJj7g+kcX94pu0QWT5rncfRA2owed9/5 acNNLTVE6ETaNOiFiZhkGRj3MMnsY6WBk6Qrjjlz26XkkCxztvwliwXxXWrmvN6M55Jy CgYMG6bkqalet4OPgxS49JKD0NwR2PsNFkiPQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=SiA8F4OOwzYXPJJPAzGdhoD/cZ/L8V1GEIV9raADVJceqYjIz9IyN/ko//vijJrOyF 7j0q8WBEKM4ZnXdSmGbU3KatdoJPZGRNwRND6RFEYP2RidYgfLcgfDHNec+w6Otqi8Gs IPEqm1yM/RtOO0x52JgrWFQnNEYOcY9sSyQ8I= MIME-Version: 1.0 Received: by 10.14.48.10 with SMTP id u10mr3114536eeb.39.1289198139872; Sun, 07 Nov 2010 22:35:39 -0800 (PST) Received: by 10.14.125.201 with HTTP; Sun, 7 Nov 2010 22:35:39 -0800 (PST) In-Reply-To: References: Date: Mon, 8 Nov 2010 07:35:39 +0100 Message-ID: Subject: Re: Unions causing many scans of input - workaround? From: Tim Robertson To: user@hive.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thank you both, A quick glance looks like that is what I am looking for. When I get it working, I'll post the solution. Cheers, Tim On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote: > Other option would be to create a wrapper script (not use either UDF or > UDTF) > That script, in any language, can emit any number of output rows per inpu= t > row. > > Look at: > http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform > for details > > ________________________________ > From: Sonal Goyal [sonalgoyal4@gmail.com] > Sent: Sunday, November 07, 2010 8:40 PM > To: user@hive.apache.org > Subject: Re: Unions causing many scans of input - workaround? > > Hey Tim, > > You have an interesting problem. Have you tried creating a UDTF for your > case, so that you can possibly emit more than one record for each row of > your input? > > http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson > wrote: >> >> Hi all, >> >> I am porting custom MR code to Hive and have written working UDFs >> where I need them. =A0Is there a work around to having to do this in >> Hive: >> >> select * from >> ( >> =A0 =A0select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as >> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as >> f2_y, count (1) as count >> =A0 =A0from table >> =A0 =A0group by name_id, x, y, f2_x, f2_y >> >> =A0 =A0UNION ALL >> >> =A0 =A0select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as >> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as >> f2_y, count (1) as count >> =A0 =A0from table >> =A0 =A0group by name_id, x, y, f2_x, f2_y >> >> =A0 --- etc etc increasing in zoom >> ) >> >> The issue being that this does many passes over the table, whereas >> previously in my Map() I would just emit many times from the same >> input record and then let it all group in the shuffle and sort. >> I actually emit 184 times for an input record (23 zoom levels of >> google maps, and 8 ways to derive the name_id) for a single record >> which means 184 union statements - Is it possible in hive to force it >> to emit many times from the source record in the stage-1 map? >> >> (ahem) Does anyone know if Pig can do this if not in Hive? >> >> I hope I have explained this well enough to make sense. >> >> Thanks in advance, >> Tim > >