hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Balamohan <>
Subject Re: Review Request 50888: Reduce number of partition check calls in add_partitions
Date Wed, 10 Aug 2016 10:50:57 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Aug. 10, 2016, 10:50 a.m.)

Review request for hive and Ashutosh Chauhan.


In some corner cases, it is possible that partitions can have nested & multiple directories.
(e.g table/ii=1/jj=15/q=10/r=20/s=30/000000_0, table/ii=1/jj=15/q=11/r=22/s=33/000000_0 where
in ii and jj are the only partition columns).
{{HiveMetastoreChecker.getPartitionName}} ends up resolving partition names as "ii=1/jj=15/q=11/r=22/s=33"
and "ii=1/jj=15/q=10/r=20/s=30".  
When msck is run, it would end up throwing duplicate partitions exception for ii=1, jj=15
in MS. msck falls back to {{msckAddPartitionsOneByOne}}, which 
tries to repair one partition at a time and ignores any exceptions. So job completes essentially,
but ends up making lots of calls to MS and can be too slow. I will attach the latest patch
in RB

Without Patch:
msck runtime for 10000 partitions in small cluster: *370 seconds*

With Patch:
msck runtime for 10000 partitions in small cluster: *62 seconds*

Bugs: HIVE-14462

Repository: hive-git


Metastore already does all the validations. Lots of MS calls are made just before add_partitions
to double check if the partitions exists.  This impacts perf when large number of partitions
are present.

Diffs (updated)

  metastore/src/java/org/apache/hadoop/hive/metastore/ 38c0eed 
  ql/src/java/org/apache/hadoop/hive/ql/exec/ a59b781 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/ ec9deeb 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/ a164b12 
  ql/src/test/org/apache/hadoop/hive/ql/metadata/ 5b8ec60 




Rajesh Balamohan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message