hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Unions causing many scans of input - workaround?
Date Mon, 08 Nov 2010 06:35:39 GMT
Thank you both,

A quick glance looks like that is what I am looking for.  When I get
it working, I'll post the solution.

Cheers,
Tim

On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain <njain@facebook.com> wrote:
> Other option would be to create a wrapper script (not use either UDF or
> UDTF)
> That script, in any language, can emit any number of output rows per input
> row.
>
> Look at:
> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
> for details
>
> ________________________________
> From: Sonal Goyal [sonalgoyal4@gmail.com]
> Sent: Sunday, November 07, 2010 8:40 PM
> To: user@hive.apache.org
> Subject: Re: Unions causing many scans of input - workaround?
>
> Hey Tim,
>
> You have an interesting problem. Have you tried creating a UDTF for your
> case, so that you can possibly emit more than one record for each row of
> your input?
>
> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>
> Thanks and Regards,
> Sonal
>
> Sonal Goyal | Founder and CEO | Nube Technologies LLP
> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>
>
>
>
>
> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson100@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> I am porting custom MR code to Hive and have written working UDFs
>> where I need them.  Is there a work around to having to do this in
>> Hive:
>>
>> select * from
>> (
>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>> f2_y, count (1) as count
>>    from table
>>    group by name_id, x, y, f2_x, f2_y
>>
>>    UNION ALL
>>
>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>> f2_y, count (1) as count
>>    from table
>>    group by name_id, x, y, f2_x, f2_y
>>
>>   --- etc etc increasing in zoom
>> )
>>
>> The issue being that this does many passes over the table, whereas
>> previously in my Map() I would just emit many times from the same
>> input record and then let it all group in the shuffle and sort.
>> I actually emit 184 times for an input record (23 zoom levels of
>> google maps, and 8 ways to derive the name_id) for a single record
>> which means 184 union statements - Is it possible in hive to force it
>> to emit many times from the source record in the stage-1 map?
>>
>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>
>> I hope I have explained this well enough to make sense.
>>
>> Thanks in advance,
>> Tim
>
>

Mime
View raw message