hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam J. O'Donnell" <a...@immunet.com>
Subject Re: Working UDF for GeoIP lookup?
Date Mon, 15 Feb 2010 16:27:01 GMT
Edward:

I don't have access to the individual data nodes, so I can't install  
the pure perl module. I tried distributing it via the add file  
command, but that is mangling the file name, which causes perl to not  
load the module as the file name and package name dont match.  Kinda  
frustrating, but it is really all about trying to work around an issue  
on amazon's elastic map reduce.  I love the service in general, but  
some issues are frustrating.

Sent from my iPhone

On Feb 15, 2010, at 6:05, Edward Capriolo <edlinuxguru@gmail.com> wrote:

> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <adam@immunet.com>  
> wrote:
>>> Hope this helps.
>>>
>>> Carl
>>
>> How about this... .can I run a standard hadoop streaming job  
>> against a
>> hive table that is stored as a sequence file?  The idea would be I
>> would break my hive query into two separate tasks and do a hadoop
>> streaming job in between, then pick up the hive job afterwards.
>> Thoughts?
>>
>> Adam
>>
>
> I actually did do this with a streaming job. The UDF was tied up with
> the apache/gpl issues.
>
> Here is how I did this. 1 install geo-ip-perl on all datanodes
>
>  ret = qp.run(
>    " FROM ( "+
>    " FROM raw_web_data_hour "+
>    " SELECT transform( remote_ip ) "+
>    " USING 'perl geo_state.pl' "+
>    " AS ip, country_code3, region "+
>    " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour 
> +"' " +
>    " ) a " +
>    " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>    " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>    " GROUP BY a.country_code3,a.region,a.ip "
>    );
>
>
> #!/usr/bin/perl
> use Geo::IP;
> use strict;
> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",  
> GEOIP_STANDARD);
> while (<STDIN>){
>  #my $record = $gi->record_by_name("209.191.139.200");
>  chomp($_);
>  my $record = $gi->record_by_name($_);
>  print STDERR "was sent $_ \n" ;
>  if (defined $record) {
>    print $_ . "\t" . $record->country_code3 . "\t" . $record- 
> >region . "\n"  ;
>    print STDERR "return " . $record->region . "\n" ;
>  } else {
>    print "??\n";
>    print STDERR "return was undefined \n";
>  }
>
> }
>
> Good luck.

Mime
View raw message