pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject the nuts and bolts on how algebraic UDFs are invoked? // would this UDF be efficient?
Date Thu, 03 Feb 2011 20:30:02 GMT
Howdy, I am curious about how algebraic UDFs are invoked. I know about how
to write them, but in the case where your initial step is to create a costly
datastructure, how can you ensure that this is created as few times as
possible?

What I have in mind is this: I have a huge set of data, and a big one. I
essentially want to join the two. I want to create a balanced binary search
tree of the big set of data, and then do the join against that... I imagine
that you could construct the search tree in every initial algebraic call,
and on the intermediate and final you simply will be passed the constructed
search tress. HOWEVER, if the initial call will be on EVERY row of data, it
might be faster to make it an accumulator? Is there is a difference in this
case?

Would this sort of strategyfor  creating a "join" via UDF be efficient? Is
there a more efficient way I want to avoid implementing something on the
lower pig level if possible (in large part because I have little experience
with parsers etc), but am not against that if I have to (in a way, this
could be a trial run for possible trying to implement some sort of band
join).

I'd appreciate your thoughts
Jon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message