Return-Path: Delivered-To: apmail-pig-user-archive@www.apache.org Received: (qmail 73480 invoked from network); 3 Feb 2011 20:30:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 20:30:31 -0000 Received: (qmail 52189 invoked by uid 500); 3 Feb 2011 20:30:30 -0000 Delivered-To: apmail-pig-user-archive@pig.apache.org Received: (qmail 52102 invoked by uid 500); 3 Feb 2011 20:30:30 -0000 Mailing-List: contact user-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@pig.apache.org Delivered-To: mailing list user@pig.apache.org Received: (qmail 52088 invoked by uid 99); 3 Feb 2011 20:30:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 20:30:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jcoveney@gmail.com designates 209.85.214.49 as permitted sender) Received: from [209.85.214.49] (HELO mail-bw0-f49.google.com) (209.85.214.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 20:30:23 +0000 Received: by bwz5 with SMTP id 5so2294377bwz.22 for ; Thu, 03 Feb 2011 12:30:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=jImrvgFEis0mzJligmBCmESMF2+7BzsPDkx6Z+StAOA=; b=b6ox5Czm5SWVI2qVY6vBRfDXk7hL1Xyr0daRCP0FHGpY/XkJii+PY5LOEhnRWUg/lS 992+nv25xstJQCwXTJtwoy75O4pAnC/RDlfdySFwjYVKApVIpCqnZ+MECnEjZAPTyKdi zyI9N6BC+WbXEJRR7IzGRMIEApzDUOG7uhLEw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=vgNV30KwhrG3/0HFNbp9eLAQ53kht6rXBwK2SdCFPe0cpi9hzagvrBjn9E79sJuCke wm9jL6dtOfMlk49j0thH+0yG2bAtkXtGwO8qtpS/+hD3Y+2fElWgxYrl753p7F3HKzKy 8kXysa3wuZE4EnU14qCrPC4dF1mvkYotYUT8k= MIME-Version: 1.0 Received: by 10.204.63.211 with SMTP id c19mr10482325bki.21.1296765002432; Thu, 03 Feb 2011 12:30:02 -0800 (PST) Received: by 10.204.16.135 with HTTP; Thu, 3 Feb 2011 12:30:02 -0800 (PST) Date: Thu, 3 Feb 2011 15:30:02 -0500 Message-ID: Subject: the nuts and bolts on how algebraic UDFs are invoked? // would this UDF be efficient? From: Jonathan Coveney To: user@pig.apache.org Content-Type: multipart/alternative; boundary=001636e1ecff53c300049b66a12a --001636e1ecff53c300049b66a12a Content-Type: text/plain; charset=ISO-8859-1 Howdy, I am curious about how algebraic UDFs are invoked. I know about how to write them, but in the case where your initial step is to create a costly datastructure, how can you ensure that this is created as few times as possible? What I have in mind is this: I have a huge set of data, and a big one. I essentially want to join the two. I want to create a balanced binary search tree of the big set of data, and then do the join against that... I imagine that you could construct the search tree in every initial algebraic call, and on the intermediate and final you simply will be passed the constructed search tress. HOWEVER, if the initial call will be on EVERY row of data, it might be faster to make it an accumulator? Is there is a difference in this case? Would this sort of strategyfor creating a "join" via UDF be efficient? Is there a more efficient way I want to avoid implementing something on the lower pig level if possible (in large part because I have little experience with parsers etc), but am not against that if I have to (in a way, this could be a trial run for possible trying to implement some sort of band join). I'd appreciate your thoughts Jon --001636e1ecff53c300049b66a12a--