hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vihang Karajgaonkar (JIRA)" <>
Subject [jira] [Created] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of for pool size
Date Wed, 22 Feb 2017 22:53:44 GMT
Vihang Karajgaonkar created HIVE-16014:

             Summary: HiveMetastoreChecker should use hive.metastore.fshandler.threads instead
of for pool size
                 Key: HIVE-16014
             Project: Hive
          Issue Type: Improvement
            Reporter: Vihang Karajgaonkar
            Assignee: Vihang Karajgaonkar

HiveMetastoreChecker uses configuration value for determining the pool
size as below :

private void checkPartitionDirs(Path basePath, Set<Path> allDirs, int maxDepth) throws
IOException, HiveException {
    ConcurrentLinkedQueue<Path> basePaths = new ConcurrentLinkedQueue<>();
    Set<Path> dirSet = Collections.newSetFromMap(new ConcurrentHashMap<Path, Boolean>());
    // Here we just reuse the THREAD_COUNT configuration for
    int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 15);

    // Check if too low config is provided for move files. 2x CPU is reasonable max count.
    poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
        Runtime.getRuntime().availableProcessors() * 2);

msck is commonly used to add the missing partitions for the table from the Filesystem. In
such a case different pool sizes for HMSHandler and HiveMetastoreChecker can affect the performance.
Eg. If {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and {{}}
is much higher like 100 or vice versa the smaller pool will become the bottleneck. If would
be good to use {{hive.metastore.fshandler.threads}} to size the pool for HiveMetastoreChecker
since the number missing partitions and number of partitions to be added will most likely
be the same. In such a case the performance of the query will be optimum when both the pool
sizes are same.

Since it is possible to tune both the configs individually it will be very likely that they
may be different. But since there is a strong co-relation between amount of work done by HiveMetastoreChecker
and HiveMetastore.add_partitions call it might be a good idea to use {{hive.metastore.fshandler.threads}}
for pool size instead of {{}}

This message was sent by Atlassian JIRA

View raw message