Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5135E200B61 for ; Tue, 9 Aug 2016 22:54:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4E05B160AB0; Tue, 9 Aug 2016 20:54:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 92000160A6B for ; Tue, 9 Aug 2016 22:54:21 +0200 (CEST) Received: (qmail 28668 invoked by uid 500); 9 Aug 2016 20:54:20 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 28657 invoked by uid 99); 9 Aug 2016 20:54:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2016 20:54:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A9E9E2C02A1 for ; Tue, 9 Aug 2016 20:54:20 +0000 (UTC) Date: Tue, 9 Aug 2016 20:54:20 +0000 (UTC) From: "Saket Saurabh (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 09 Aug 2016 20:54:22 -0000 [ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saket Saurabh updated HIVE-14233: --------------------------------- Attachment: HIVE-14233.06.patch This patch disallows VectorizedRowBatchReader creation on original files. > Improve vectorization for ACID by eliminating row-by-row stitching > ------------------------------------------------------------------ > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization > Reporter: Saket Saurabh > Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, HIVE-14233.06.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating row-by-row stitching when reading back ACID files. In the current implementation, a vectorized row batch is created by populating the batch one row at a time, before the vectorized batch is passed up along the operator pipeline. This row-by-row stitching limitation was because of the fact that the ACID insert/update/delete events from various delta files needed to be merged together before the actual version of a given row was found out. HIVE-14035 has enabled us to break away from that limitation by splitting ACID update events into a combination of delete+insert. In fact, it has now enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier bottleneck in the vectorized code path for ACID by now directly reading row batches from the underlying ORC files and avoiding any stitching altogether. Once a row batch is read from the split (which may be on a base/delta file), the deleted rows will be found by cross-referencing them against a data structure that will just keep track of deleted events (found in the deleted_delta files). This will lead to a large performance gain when reading ACID files in vectorized fashion, while enabling further optimizations in future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)