hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Java UDF
Date Wed, 28 Apr 2010 08:11:47 GMT
Ok, so it turns out I overlooked some things in my current MR job with the
configure() and a UDF isn't enough.

I do want to use the Hive Map keyword and call my own MR map().

Currently my map() looks like the following, which works on a tab delimited
input file:

public void map(LongWritable key, Text value,

  OutputCollector<Text, Text> collector, Reporter reporter)

  throws IOException {

   Pattern tab = Pattern.compile("\t");

  String[] atoms = tab.split(value.toString());

  String parsed = myParseFunction(atoms);

  collector.collect(new Text(parsed), new Text(atoms[0]));

What would I need to implement to make this usable with a Map keyword in
Hive please, so I can run this with input from table 1, to populate table
2?

Sorry for this confusion, but it is not really clear to me - all help is
very gratefully received.

Cheers
Tim




On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony <aavram@mac.com> wrote:

>
> Hi -
>
> If you would like to "simply take an input String (Text) run some Java  and return a
new (Text) by calling a function" then you may wish to consider using the "map" and "reduce"
keywords directly from Hive and using a scripting language like Perl that contains your mapper
and reducer code.
>
> for example:
>
> create external table some_input_table ( field_1 string ) row format (etc...);
> create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3
string );
>
>
> from (
>    from some_input_table i
>      map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
>    insert overwrite table your_next_table
>      reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2,
output_field_3
> ;
>
> --test it
> select * from your_next_table ;
>
> Hope that helps.
>
> cheers,
> Avram
>
>
>
>
>
> On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" <timrobertson100@gmail.com>
wrote:
> >
>
>  Thanks Edward,
>
>  I get where you are coming from now with that explanation.
>
>  Cheers,
> Tim
>
>
> On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>
>>
>>
>> On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <timrobertson100@gmail.com
>> > wrote:
>>
>>> Hmmm... I am not trying to serialize or deserialize custom content, but
>>> simply take an input String (Text) run some Java  and return a new (Text) by
>>> calling a function
>>>
>>>  Looking at public class UDFYear extends UDF { the annotation at the top
>>> suggests extending UDF and adding the annotation, might be enough.
>>>
>>>  I'll try it anyways...
>>> Tim
>>>
>>> On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell <adam@immunet.com>wrote:
>>>
>>>> It sounds like what you want is a custom SerDe.  I have tried to write
>>>> one but ran into some difficulty.
>>>>
>>>> On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
>>>>  <timrobertson100@gmail.com> wrote:
>>>> > Thanks Edward,
>>>> > You are indeed correct - I am confused!
>>>> > So I checked out the source, and poked around.  If I were to extend
>>>> UDF and
>>>> > implement  public Text evaluate(Text source) {
>>>> > would I be heading along the correct lines to use what you say above?
>>>> > Thanks,
>>>> > Tim
>>>> >
>>>> >
>>>> > On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
>>>> edlinuxguru@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >> On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
>>>> >> <timrobertson100@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>> I currently run a MapReduce job to rewrite a tab delimited file,
and
>>>> then
>>>> >>> I use Hive for everything after that stage.
>>>> >>> Am I correct in thinking that I can create a Jar with my own
method
>>>> which
>>>> >>> can then be called in SQL?
>>>> >>> Would the syntax be:
>>>> >>>   hive> ADD JAR /tmp/parse.jar;
>>>> >>>   hive> INSERT OVERWRITE TABLE target SELECT s.id,
>>>> >>> s.canonical, parsedName FROM source s MAP s.canonical using
'parse'
>>>> as
>>>> >>> parsedName;
>>>> >>> and parse be a MR job?  If so what are the input and output
formats
>>>> >>> please for the parse?  Or is it a class implementing an interface
>>>> perhaps
>>>> >>> and Hive take care of the rest?
>>>> >>> Thanks for any pointers,
>>>> >>> Tim
>>>> >>>
>>>> >>
>>>> >> Tim,
>>>> >>
>>>> >> A UDF is an sql function like toString() max()
>>>> >> An InputFormat teachers hive to read data from Key Value files
>>>> >> A serde tells Hive how to parse input data into columns.
>>>> >> Finally, the map()reduce(), transform() keywords you described is
a
>>>> way to
>>>> >> pipe data to external process and read the results back in. Almost
>>>> like a
>>>> >> non-native to hive UDF.
>>>> >>
>>>> >> So you have munged up 4 concepts together :) Do not feel bad however,
>>>> I
>>>> >> struggled though an input format for the last month.
>>>> >>
>>>> >> It sounds most like you want a udf that takes a string and returns
a
>>>> >> canonical representation.
>>>> >>
>>>> >>
>>>> >>   hive> ADD JAR /tmp/parse.jar;
>>>> >> create temporary function canonical as 'my.package.canonical';
>>>> >> select canonical(my colum) from source;
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>>  --
>>>> Adam J. O'Donnell, Ph.D.
>>>> Immunet Corporation
>>>> Cell: +1 (267) 251-0070
>>>>
>>>
>>>
>>  Tim,
>>
>> I think you are on the right track with the UDF approach.
>>
>> You could accomplish something similiar with a serdy accept from the
>> client prospecting it would be more "transparent".
>>
>> A UDF is a bit more reusable then a serde. You can only chose a serde once
>> when the table is created, but you UDF is applied on the resultset.
>>
>> Edward
>>
>
>

Mime
View raw message