Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 166989DA1 for ; Sat, 3 Mar 2012 00:12:23 +0000 (UTC) Received: (qmail 7494 invoked by uid 500); 3 Mar 2012 00:12:22 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 7453 invoked by uid 500); 3 Mar 2012 00:12:22 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 7444 invoked by uid 99); 3 Mar 2012 00:12:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 00:12:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 00:12:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 1706B6E57 for ; Sat, 3 Mar 2012 00:11:58 +0000 (UTC) Date: Sat, 3 Mar 2012 00:11:58 +0000 (UTC) From: "Zhenxiao Luo (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <45073056.15945.1330733518095.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <822161199.6144.1330578726736.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3952) In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns 0 split. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221369#comment-13221369 ] Zhenxiao Luo commented on MAPREDUCE-3952: ----------------------------------------- @Bhallamudi Yes. Seems the input file is an empty file from execution log: 2012-02-28 15:56:37,219 INFO exec.ExecDriver (ExecDriver.java:addInputPath(829)) - Changed input file to file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1 2012-02-28 15:56:37,226 INFO util.NativeCodeLoader (NativeCodeLoader.java:(50)) - Loaded the native-hadoop library 2012-02-28 15:56:37,610 INFO jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 2012-02-28 15:56:37,626 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(234)) - Making Temp Directory: file:/tmp/cloudera/hive_2012-02-28_15-56-26_431_554636048819260524/-mr-10003 2012-02-28 15:56:37,657 INFO jvm.JvmMetrics (JvmMetrics.java:init(71)) - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2012-02-28 15:56:37,684 WARN mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(139)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-02-28 15:56:37,960 WARN snappy.LoadSnappy (LoadSnappy.java:(36)) - Snappy native library is available 2012-02-28 15:56:37,961 INFO snappy.LoadSnappy (LoadSnappy.java:(44)) - Snappy native library loaded 2012-02-28 15:56:37,969 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit creating pool for file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1; using filter path file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1 2012-02-28 15:56:37,970 WARN conf.Configuration (Configuration.java:handleDeprecation(326)) - mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 2012-02-28 15:56:37,970 WARN conf.Configuration (Configuration.java:handleDeprecation(326)) - mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 2012-02-28 15:56:37,971 WARN conf.Configuration (Configuration.java:handleDeprecation(326)) - mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 2012-02-28 15:56:37,971 WARN conf.Configuration (Configuration.java:handleDeprecation(326)) - mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 2012-02-28 15:56:37,977 INFO input.FileInputFormat (FileInputFormat.java:listStatus(245)) - Total input paths to process : 1 2012-02-28 15:56:37,982 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(388)) - Arrays.asList iss 2012-02-28 15:56:37,982 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(410)) - iss size: 0 2012-02-28 15:56:37,983 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(417)) - number of splits 0 And, in MR1, the log looks like: 2012-02-28 14:09:54,554 INFO exec.ExecDriver (ExecDriver.java:addInputPath(829)) - Changed input file to file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1 2012-02-28 14:09:54,855 INFO jvm.JvmMetrics (JvmMetrics.java:init(71)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 2012-02-28 14:09:54,871 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(234)) - Making Temp Directory: file:/tmp/cloudera/hive_2012-02-28_14-09-44_700_3241431154033268523/-mr-10003 2012-02-28 14:09:54,881 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(539)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-02-28 14:09:55,037 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit creating pool for file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1; using filter path file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1 2012-02-28 14:09:55,042 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(192)) - Total input paths to process : 1 2012-02-28 14:09:55,056 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(406)) - iss size: 1 2012-02-28 14:09:55,057 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(409)) - adding inputSplitShim into result: Paths:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1/emptyFile:0+0 Locations:/default-rack:; InputFormatClass: org.apache.hadoop.mapred.TextInputFormat 2012-02-28 14:09:55,057 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(413)) - number of splits 1 So, in MR1, submitting a job having empty file get split length == 1, while in MR2, submitting a job having empty file get split length == 0. The case happens in Hive(https://issues.apache.org/jira/browse/HIVE-2783), when trying to run the following query in Hive: select * from ( select key, value, ds from t1_new union all select key, value, t1_old.ds from t1_old join t1_mapping on t1_old.keymap = t1_mapping.keymap and t1_old.ds = t1_mapping.ds ) subq where ds = '2011-10-13'; And, the second MR job is trying to execute: select key, value, ds from t1_new which has an empty input file in the submitted job. My understanding might be wrong. Correct me if there is anything goes wrong. Thanks, Zhenxiao > In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns 0 split. > --------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-3952 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3952 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Zhenxiao Luo > > Hive get unexpected result when using MR2(When using MR1, always get expected result). > In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns 0 split. > The calling code in Hive, in Hadoop23Shims.java: > InputSplit[] splits = super.getSplits(job, numSplits); > this get splits.length == 0. > In MR1, everything goes fine, the calling code in Hive, in Hadoop20Shims.java: > CombineFileSplit[] splits = (CombineFileSplit[]) super.getSplits(job, numSplits); > this get splits.length == 1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira