hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Arenas <eare...@rocketmail.com>
Subject Re: Working UDF for GeoIP lookup?
Date Tue, 16 Feb 2010 19:54:35 GMT
Hi Ed,

I created a similar UDF some time ago, and if I am not mistaken you have to assume that your
file is going to be in the same directory, as in:

path_of_dat_file = "./name_of_file";

And it worked for me,

let me know if this solves your issue, and if not, I will look into my old code and see how
I did it.

regards
Eric Arenas



----- Original Message ----
From: Edward Capriolo <edlinuxguru@gmail.com>
To: hive-user@hadoop.apache.org
Sent: Tue, February 16, 2010 7:47:30 AM
Subject: Re: Working UDF for GeoIP lookup?

On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <adam@immunet.com> wrote:
>> Edward:
>>
>> I don't have access to the individual data nodes, so I can't install the
>> pure perl module. I tried distributing it via the add file command, but that
>> is mangling the file name, which causes perl to not load the module as the
>> file name and package name dont match.  Kinda frustrating, but it is really
>> all about trying to work around an issue on amazon's elastic map reduce.  I
>> love the service in general, but some issues are frustrating.
>>
>> Sent from my iPhone
>>
>> On Feb 15, 2010, at 6:05, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>>
>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <adam@immunet.com> wrote:
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Carl
>>>>
>>>> How about this... .can I run a standard hadoop streaming job against a
>>>> hive table that is stored as a sequence file?  The idea would be I
>>>> would break my hive query into two separate tasks and do a hadoop
>>>> streaming job in between, then pick up the hive job afterwards.
>>>> Thoughts?
>>>>
>>>> Adam
>>>>
>>>
>>> I actually did do this with a streaming job. The UDF was tied up with
>>> the apache/gpl issues.
>>>
>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>
>>>  ret = qp.run(
>>>   " FROM ( "+
>>>   " FROM raw_web_data_hour "+
>>>   " SELECT transform( remote_ip ) "+
>>>   " USING 'perl geo_state.pl' "+
>>>   " AS ip, country_code3, region "+
>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>   " ) a " +
>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>   );
>>>
>>>
>>> #!/usr/bin/perl
>>> use Geo::IP;
>>> use strict;
>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>> GEOIP_STANDARD);
>>> while (<STDIN>){
>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>  chomp($_);
>>>  my $record = $gi->record_by_name($_);
>>>  print STDERR "was sent $_ \n" ;
>>>  if (defined $record) {
>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>  ;
>>>   print STDERR "return " . $record->region . "\n" ;
>>>  } else {
>>>   print "??\n";
>>>   print STDERR "return was undefined \n";
>>>  }
>>>
>>> }
>>>
>>> Good luck.
>>
>
> Sorry to hear that your having problems. It is a fairly simple UDF,
> for those familiar writing udf/genudf. You probably could embed the
> lookup data file in the jar as well. I meant to build/host this on my
> site, but I have not got around to it. If you want to tag team it, I
> am interested.
>
So I started working on this:
I packaged geo-ip into a jar:
http://www.jointhegrid.com/svn/geo-ip-java/
And I am building a Hive UDF
http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/

I am running into a problem, I am trying to have the UDF work with two
signatures

geoip('209.191.139.200', 'STATE_NAME');
geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );

For the first invocation I have bundled the data into the JAR file. I
have verified that I can access it:
http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java

I am trying to do the same thing inside by UDF but I get FileNotFound
exceptions. I have also tried adding the file to the distributed
cache.

add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
add jar /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;


Any hints ? I did notice a Jira about UDF reading distributed cache,
so that may be an issue. I still wonder though why I can not pull the
file out of the jar. Any hints?

-ed


Mime
View raw message