Return-Path: Delivered-To: apmail-hadoop-pig-commits-archive@www.apache.org Received: (qmail 15133 invoked from network); 7 Feb 2009 04:23:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Feb 2009 04:23:32 -0000 Received: (qmail 20705 invoked by uid 500); 7 Feb 2009 04:23:29 -0000 Delivered-To: apmail-hadoop-pig-commits-archive@hadoop.apache.org Received: (qmail 20690 invoked by uid 500); 7 Feb 2009 04:23:29 -0000 Mailing-List: contact pig-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-commits@hadoop.apache.org Received: (qmail 20681 invoked by uid 500); 7 Feb 2009 04:23:28 -0000 Delivered-To: apmail-incubator-pig-commits@incubator.apache.org Received: (qmail 20678 invoked by uid 99); 7 Feb 2009 04:23:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Feb 2009 20:23:28 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Feb 2009 04:23:25 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 5B832118E7 for ; Sat, 7 Feb 2009 04:23:04 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: pig-commits@incubator.apache.org Date: Sat, 07 Feb 2009 04:23:04 -0000 Message-ID: <20090207042304.839.36463@eos.apache.org> Subject: [Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by GuntherHagleitner: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification New page: [[Anchor(Multi-query_Performance)]] = Multi-query Performance = Currently scripts with multiple store commands can result in a lot of duplicated work. The idea how to avoid the duplication is described here: https://issues.apache.org/jira/browse/PIG-627 [[Anchor(External)]] == External == [[Anchor(Use_cases:)]] === Use cases: === [[Anchor(Explicit/implicit_split:)]] ==== Explicit/implicit split: ==== There might be cases in which you want to different processing on separate parts of the same datastream. Like so: {{{ A = load ... ... split A' into B if ..., C if ... ... store B' ... store C' ... }}} or {{{ A=load ... ... B=filter A' ... C=filter A' ... ... store B' ... store C' ... }}} In the current system the first example will dump A' to disk and then start jobs for B' and C'. In the second example Pig will execute all the dependencies of B' and store it. And then execute all the dependencies of C' and store it. Both of the above are equivalent, but the performance will be different. Here's what we plan to do to increase the performance: * In the second case we will add an implicit split to transform the query to case number one. That will eliminate the processing of A' multiple times. * Make the split non-blocking and allow processing to continue. This will help reduce the amount of data that has to be stored right at the split. * Allow multiple outputs from a job. This way we can store some results as a side-effect. This is also necessary to make the previous item work. * Allow multiple split branches to be carried on to the combiner/reducer. This will reduce the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. [[Anchor(Storing_intermediate_results)]] ==== Storing intermediate results ==== Sometimes people will store intermediate results. {{{ A=load ... ... store A' ... store A'' }}} If the script doesn't re-load A' for the processing of A'' the steps above A' will be duplicated. This is basically a special case of Number 2 above, so the same steps are recommended. With the proposed changes the script will basically process A'' and dump A' as a side-effect. Which is what the user probably wanted to begin with. [[Anchor(Why?)]] === Why? === Pig's philosophy is: Optimize it yourself, why don't you. However: * Implicit splits: It's probably what you expect when you use the same handle in different stores. * Store/Load vs Split: When optimizing, it's a reasonable assumption that splits are faster than load/store combinations * Side-effects: There is no way right now to make use of this [[Anchor(Changes)]] === Changes === [[Anchor(Execution_in_batch_mode)]] ==== Execution in batch mode ==== Batch mode is entered when Pig is given a script to execute. Interactive mode is on the grunt shell ("grunt:>"). Right now there isn't much difference between them. In order for us to optimize the multi-query case, we'll need to distinguish the two more. Right now whenever the parser sees a store (or dump, explain, illustrate or describe) it will kick of the execution of that part of the script. Part of this proposal is that in batch mode, we parse the entire script first and see if we can combine things to reduce the overall amount of work that needs to be done. Only after that will the execution start. The following changes are proposed (in batch): * Store will not trigger an immediate execution. The entire script is considered before the execution starts. * Explicit splits will be put in places where a handle has multiple children. If the user wants to explicitly force re-computation of common ancestors she has to provide multiple scripts. * Multiple split branches/stores in the script will be combined into the same job, if possible. Again, using multiple scripts is the way to go to avoid this (if that is desired). For diagnostic operators there are some problems with this: * They work on handles, which only gives you a slice of the entire script execution at a time. What's more, is that at the point they may occur in a script they might not give you an accurate picture about the situation, since the execution plans might change once the entire script is handled. * They change the logical tree. This means that we need to clone the tree before we run them - something that we want to avoid in batch execution. The proposal therefore is: * Have Pig in batch mode ignore explain, dump, illustrate and describe. * Add a load command to the shell to execute a script in interactive mode. * Add scripts as a target (in additions to handles) to some diagnostic parameters. * Add dot as an output type to explain (a graphical explanation of the graph will make multi-query explains more understandable.) That means that while someone is developing a PIG script they can put any diagnostic operator into the script and then go to the grunt shell and load the script. The statement will be executed and give you some information about that part of the script. When a script is loaded, the user will also be able to refer to any handles defined in the script on the shell. Finally, when the script is ready the user can run the same script in batch and all the diagnostic operators are ignored. [[Anchor(Load)]] ==== Load ==== (See https://issues.apache.org/jira/browse/PIG-574 - this is basically the same as requested there) The new command has the format: {{{ load