hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhenxiao Luo (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3952) In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns 0 split.
Date Sat, 03 Mar 2012 00:11:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221369#comment-13221369
] 

Zhenxiao Luo commented on MAPREDUCE-3952:
-----------------------------------------

@Bhallamudi

Yes. Seems the input file is an empty file from execution log:


2012-02-28 15:56:37,219 INFO  exec.ExecDriver (ExecDriver.java:addInputPath(829)) - Changed
input file to file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1
2012-02-28 15:56:37,226 INFO  util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(50))
- Loaded the native-hadoop library
2012-02-28 15:56:37,610 INFO  jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM
Metrics with processName=JobTracker, sessionId=
2012-02-28 15:56:37,626 INFO  exec.ExecDriver (ExecDriver.java:createTmpDirs(234)) - Making
Temp Directory: file:/tmp/cloudera/hive_2012-02-28_15-56-26_431_554636048819260524/-mr-10003
2012-02-28 15:56:37,657 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) - Cannot initialize
JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-02-28 15:56:37,684 WARN  mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(139))
- Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for
the same.
2012-02-28 15:56:37,960 WARN  snappy.LoadSnappy (LoadSnappy.java:<clinit>(36)) - Snappy
native library is available
2012-02-28 15:56:37,961 INFO  snappy.LoadSnappy (LoadSnappy.java:<clinit>(44)) - Snappy
native library loaded
2012-02-28 15:56:37,969 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370))
- CombineHiveInputSplit creating pool for file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1;
using filter path file:/tmp/cloudera/hive_2012-02-28_15-56-37_188_1216173472421796708/-mr-10000/1
2012-02-28 15:56:37,970 WARN  conf.Configuration (Configuration.java:handleDeprecation(326))
- mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
2012-02-28 15:56:37,970 WARN  conf.Configuration (Configuration.java:handleDeprecation(326))
- mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
2012-02-28 15:56:37,971 WARN  conf.Configuration (Configuration.java:handleDeprecation(326))
- mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
2012-02-28 15:56:37,971 WARN  conf.Configuration (Configuration.java:handleDeprecation(326))
- mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
2012-02-28 15:56:37,977 INFO  input.FileInputFormat (FileInputFormat.java:listStatus(245))
- Total input paths to process : 1
2012-02-28 15:56:37,982 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(388))
- Arrays.asList iss
2012-02-28 15:56:37,982 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(410))
- iss size: 0
2012-02-28 15:56:37,983 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(417))
- number of splits 0

And, in MR1, the log looks like:

2012-02-28 14:09:54,554 INFO  exec.ExecDriver (ExecDriver.java:addInputPath(829)) - Changed
input file to file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1
2012-02-28 14:09:54,855 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) - Initializing JVM
Metrics with processName=JobTracker, sessionId=
2012-02-28 14:09:54,871 INFO  exec.ExecDriver (ExecDriver.java:createTmpDirs(234)) - Making
Temp Directory: file:/tmp/cloudera/hive_2012-02-28_14-09-44_700_3241431154033268523/-mr-10003
2012-02-28 14:09:54,881 WARN  mapred.JobClient (JobClient.java:configureCommandLineOptions(539))
- Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for
the same.
2012-02-28 14:09:55,037 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370))
- CombineHiveInputSplit creating pool for file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1;
using filter path file:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1
2012-02-28 14:09:55,042 INFO  mapred.FileInputFormat (FileInputFormat.java:listStatus(192))
- Total input paths to process : 1
2012-02-28 14:09:55,056 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(406))
- iss size: 1
2012-02-28 14:09:55,057 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(409))
- adding inputSplitShim into result: Paths:/tmp/cloudera/hive_2012-02-28_14-09-54_515_1377575814725676804/-mr-10000/1/emptyFile:0+0
Locations:/default-rack:; InputFormatClass: org.apache.hadoop.mapred.TextInputFormat

2012-02-28 14:09:55,057 INFO  io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(413))
- number of splits 1

So, in MR1, submitting a job having empty file get split length == 1, while in MR2, submitting
a job having empty file get split length == 0.

The case happens in Hive(https://issues.apache.org/jira/browse/HIVE-2783), when trying to
run the following query in Hive:

select * from
(
select key, value, ds from t1_new
union all
select key, value, t1_old.ds from t1_old join t1_mapping
on t1_old.keymap = t1_mapping.keymap and
   t1_old.ds = t1_mapping.ds
) subq
where ds = '2011-10-13';

And, the second MR job is trying to execute:

select key, value, ds from t1_new

which has an empty input file in the submitted job.

My understanding might be wrong. Correct me if there is anything goes wrong.

Thanks,
Zhenxiao


                
> In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns
0 split.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3952
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3952
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Zhenxiao Luo
>
> Hive get unexpected result when using MR2(When using MR1, always get expected result).
> In MR2, when Total input paths to process == 1, CombinefileInputFormat.getSplits() returns
0 split.
> The calling code in Hive, in Hadoop23Shims.java:
> InputSplit[] splits = super.getSplits(job, numSplits);
> this get splits.length == 0.
> In MR1, everything goes fine, the calling code in Hive, in Hadoop20Shims.java:
> CombineFileSplit[] splits = (CombineFileSplit[]) super.getSplits(job, numSplits);
> this get splits.length == 1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message