Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FB0410C3D for ; Thu, 2 Jan 2014 14:04:39 +0000 (UTC) Received: (qmail 55545 invoked by uid 500); 2 Jan 2014 14:04:33 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 55488 invoked by uid 500); 2 Jan 2014 14:04:27 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 55470 invoked by uid 99); 2 Jan 2014 14:04:24 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jan 2014 14:04:24 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 4D4B61D3C63; Thu, 2 Jan 2014 14:04:22 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============6922415694591234808==" MIME-Version: 1.0 Subject: Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch) From: "Lorand Bendig" To: "Lorand Bendig" , "pig" , "Cheolsoo Park" Date: Thu, 02 Jan 2014 14:04:22 -0000 Message-ID: <20140102140422.1720.78661@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Lorand Bendig" X-ReviewGroup: pig X-ReviewRequest-URL: https://reviews.apache.org/r/16507/ X-Sender: "Lorand Bendig" References: <20131230215031.21784.3136@reviews.apache.org> In-Reply-To: <20131230215031.21784.3136@reviews.apache.org> Reply-To: "Lorand Bendig" X-ReviewRequest-Repository: pig --===============6922415694591234808== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit > On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: > > This a great work. Thank you very much! > > > > I have few minor comments below mostly about tests. Cheolsoo, thanks for taking your time to review it! I fixed/commented the issues. > On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, line 74 > > > > > > Can you move this to PigConfiguration? PigConfiguration seems to be a better place to put OPT_FETCH > On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java, line 23 > > > > > > Do you mind removing unused imports? > > - import java.util.LinkedList; > > - import org.apache.pig.impl.util.IdentityHashSet; > > - import org.apache.pig.pen.util.LineageTracer; Sure. Intially didn't want to remove these leftovers from PIG-1712 > On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: > > /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94 > > > > > > Does this else block ever get executed given the we're running the test with opt.fetch on? > > > > I think you can do either- > > > > 1) explicitly set opt.fetch to true or false in setup(), > > > > or > > > > 2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off. Not really. I chose the second option > On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: > > /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174 > > > > > > Why is this changed? I think the default value for opt.multiquery is true. I accidentally changed it - Lorand ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/#review30936 ----------------------------------------------------------- On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16507/ > ----------------------------------------------------------- > > (Updated Dec. 29, 2013, 11:19 p.m.) > > > Review request for pig. > > > Bugs: PIG-3642 > https://issues.apache.org/jira/browse/PIG-3642 > > > Repository: pig > > > Description > ------- > > With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: > > it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc > no scalar aliases > no SampleLoader > single leaf job > DUMP (no STORE) > > The feature is enabled by default and can be toggled with: > > -N or -no_fetch > set opt.fetch true/false; > > There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) > > > Diffs > ----- > > /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 > /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 > /trunk/src/org/apache/pig/Main.java 1553596 > /trunk/src/org/apache/pig/PigServer.java 1553596 > /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION > /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 > /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 > /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 > /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 > /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION > /trunk/test/org/apache/pig/test/TestAssert.java 1553596 > /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 > /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION > /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 > > Diff: https://reviews.apache.org/r/16507/diff/ > > > Testing > ------- > > - new testcase added: TestFetch > - the patch was checked against test-commit and test-core > - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible > > > Thanks, > > Lorand Bendig > > --===============6922415694591234808==--