hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <>
Subject [jira] [Created] (HIVE-11683) Hive Streaming may overload the metastore
Date Fri, 28 Aug 2015 18:58:46 GMT
Eugene Koifman created HIVE-11683:

             Summary: Hive Streaming may overload the metastore
                 Key: HIVE-11683
             Project: Hive
          Issue Type: Bug
          Components: HCatalog, Hive, Transactions
    Affects Versions: 1.0.0
            Reporter: Eugene Koifman
            Assignee: Roshan Naik

HiveEndPoint represents a way to write to a specific partition transactionally.
Each HiveEndPoint creates TransactionBatch(es) and commits transactions.

Suppose you have 10 instances of Storm Hive bolt using Streaming API.
Each instance will create HiveEndPoints on demand when it sees an event for particular partition

If events are uniformly distributed wrt partition values and the table has 1000 partitions
(for example it's partitioned by CustomerId), each of 10 bolt instances may create 1000 HiveEndPoints
and thus > 10,000 (actually 10K * num_txn_per_batch) concurrent transactions.

This creates huge amount of Metastore traffic.

HIVE-11672 is investigating how some sort of "shuffle" phase can be added route events for
a particular bucket to the same bolt instance.

The same idea should explored to route events based on partition value.

cc [~alangates],[~sriharsha],[~rbains]

This message was sent by Atlassian JIRA

View raw message