hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-11672) Hive Streaming API handles bucketing incorrectly
Date Fri, 28 Aug 2015 16:36:46 GMT

     [ https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman updated HIVE-11672:
----------------------------------
    Fix Version/s:     (was: 1.2.2)

> Hive Streaming API handles bucketing incorrectly
> ------------------------------------------------
>
>                 Key: HIVE-11672
>                 URL: https://issues.apache.org/jira/browse/HIVE-11672
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Hive, Transactions
>    Affects Versions: 1.2.1
>            Reporter: Raj Bains
>            Assignee: Roshan Naik
>            Priority: Critical
>
> Hive Streaming API allows the clients to get a random bucket and then insert data into
it. However, this leads to incorrect bucketing as Hive expects data to be distributed into
buckets based on a hash function applied to bucket key. The data is inserted randomly by the
clients right now. They have no way of
> # Knowing what bucket a row (tuple) belongs to
> # Asking for a specific bucket
> There are optimization such as Sort Merge Join and Bucket Map Join that rely on the data
being correctly distributed across buckets and these will cause incorrect read results if
the data is not distributed correctly.
> There are two obvious design choices
> # Hive Streaming API should fix this internally by distributing the data correctly
> # Hive Streaming API should expose data distribution scheme to the clients and allow
them to distribute the data correctly
> The first option will mean every client thread will write to many buckets, causing many
small files in each bucket and too many connections open. this does not seem feasible. The
second option pushes more functionality into the client of the Hive Streaming API, but can
maintain high throughput and write good sized ORC files. This option seems preferable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message