pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Coveney (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2317) Ruby/Jruby UDFs
Date Thu, 20 Oct 2011 21:07:10 GMT

     [ https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Coveney updated PIG-2317:
----------------------------------

    Attachment: jruby_scripting_6.patch
                pigudf.rb
                pigjruby.rb

So! Made some new changes. There is now an accumulator interface.

{code}
class SUM2 < AccumulatorPigUDF
  output_schema "val:long"

  def exec items
    @sum||=0
    @sum+=items.flatten.inject(:+)
  end

  def get
    @sum
  end
end
{code}

One interesting thing about the accumulator interface is that all of the state is handled
inside of the ruby class...so if you want intermediate objects, it's all there. The cleanup
step is just throwing away the class, and then it will be reinstantiated if the interface
is invoked again.

Algebraic UDFs are easier than ever.

{code}
class SUM < AlgebraicPigUDF
  output_schema "val:long"
  
  def initial item
    item
  end
  
  def intermed items
    items.flatten.inject(:+)
  end
  
  def final items
    intermed items
  end
end

class WORDCOUNT < AlgebraicPigUDF
  output_schema "val:long"

  def initial item
    item ? item.split.length : 0
  end

  def intermed items
    items.flatten.inject(:+)
  end
  
  def final items
    intermed items
  end
end
{code}

One of the more exciting changes (to me...) is that I have added DataBags as a native ruby
object, so it's super easy to use them. If you do include the pigudf package, you can do "DataBag.new."
Examples of how to use it follow:
{code}
jruby -J-Xmx1024m -S irb
{code}
this ensures that you have enough heap space

{code}
require 'pigudf'
db=DataBag.new
{code}
a is now a databag! to test that it spills properly, we do...
{code}
(0..10000000).each {|x| db.add(x)}
{code}

On my computer, with the heap size we specified, it spilled once. But it spills! Also, a note:
arrays still convert to tuples, and a bag can either accept ONE argument, or an array of arguments.
The one argument thing is a convenience function. I will probably make it a varargs for conciseness.
But that means you can do

{code}
db.add(1)
{code}

or

{code}
db.add([1])
{code}

After running the each above, you get:

{code}
ree-1.8.7-2010.02 :009 > db.size()
 => 10000001
{code}

Nice! I need to look into how to get JRuby to generate better docs, but if you look at RubyDataBag.java
in the patch you can see the api (anything marked with @JRubyMethod). I'll summarize here.

{code}
DataBag.new, DataBag.new db
{code}
DataBag has two initializers: the default initializer just creates an empty databag, and the
second takes a databag and copies it over. There is also

{code}
db.add_all db2, db.copy db2
{code}
which pulls all of the data out of the given DataBag or RubyDataBag.

{code}
db.to_s,db.to_string,db.inspect
{code}
return a string view. if you do db.to_s(true), you'll also see the contents (useful for debugging)

{code}
db.size,db.length
{code}
number of elements in the bag

{code}
db.add(elem) or db.add([e1,e2,e3])
{code}
Add the elements to the bag

{code}
db.distinct?, db.is_distinct?
{code}
returns if the bag is distinct

{code}
db.sorted?, db.is_sorted?
{code}
returns if the bag is sorted

{code}
db.clear
{code}
clears the databag

{code}
db.empty?
{code}
returns if the bag is empty

{code}
db.each
{code}
One thing that I did with the DataBag implementation is that I had it include Enumerable,
and implement each. This means that all of the fun commands you like to use in ruby like map
and so on should work... also, for convenient, I implement a flatten command

{code}
db.flatten or db.flat_each
 => #<Enumerable::Enumerator:0x8939ec3 @__args__=[], @__object__=[DataBag: size: 10000001],
@__method__=:flat_each> 
{code}
what this does is create an object that accepts .each {block}, but will flatten the value
out of the Tuple before passing it to the block. This allows you to efficiently do things
like db.flatten.inject(:+), because it is pulling the element out of the tuple on each block
invocation instead of doing the naive thing which would be to create an array of the output.
One thing to keep in mind though is that this only pulls out the first argument. I guess I
could change that. Am undecided.

And lastly, there is...

{code}
db.iterator
{code}
returns a BagIterator. This is basically a simplifed access point that is very similar to
bag, except with less power.

{code}
db.get, db.getNext, db.get_next
{code}

{code}
db.has_next?, db.hasNext, db.has_next, db.next?
{code}

and it supports the exact same map semantics as bag does.

Phew! Ok. Definitely would love feedback. I'm going to work on making UDFs in-line, and need
to write tests....
                
> Ruby/Jruby UDFs
> ---------------
>
>                 Key: PIG-2317
>                 URL: https://issues.apache.org/jira/browse/PIG-2317
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jacob Perkins
>            Assignee: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.9.2
>
>         Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch, jruby_scripting_2_real.patch,
jruby_scripting_3.patch, jruby_scripting_4.patch, jruby_scripting_5.patch, jruby_scripting_6.patch,
pigjruby.rb, pigjruby.rb, pigjruby.rb, pigudf.rb
>
>
> It should be possible to write UDFs in Ruby. These UDFs will be registered in the same
way as python and javascript UDFs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message