hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException
Date Wed, 08 Sep 2010 02:04:33 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907058#action_12907058
] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Sammy, there are mainly 2 problems. 
1) going over the map is not efficient, and 2) using startWith to do prefix match is a bug
fixed in HIVE-1510.

Sammy, can you change the logic as follows:

right now, hive generates another pathToPartitionInfo map by removing the path's schema information,
and put it in a cacheMap. 
We can keep the same logic but change the new pathToPartitionInfo map's value to be an array
of PartitionDesc. 
And then we can just remove the schema check, and once we get a match, we go through the array
of PartitionDesc to find the best one.

this can also solve another problem. If there are 2 partitionDesc which's path part is same
but the schema is different, only one is contained in the new pathToPartitionInfo map. 

About how to go through the array of PartitionDesc to find the best one:
if the array contains only 1 element, return array.get(0);
1) if the original input does not have any schema information:  if the array contains more
then 1 element, report error.
2) if the original input contains schema information: 1) if the array contains an element
which's the exact match (also contains schema and port, and the same with input), return the
exact match. 2) ignore port part but keep the schema and address, and go through the array.


what do you think?

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch,
0003-HIVE-1610.patch, 0004-hive.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword,
keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type,
keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results
keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank)
as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,
 keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1
JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed
from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on
keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type
= dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed
= dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed
) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain
AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week
AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this
error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0
in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset
from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message