Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE7FD177F5 for ; Fri, 27 Feb 2015 01:35:05 +0000 (UTC) Received: (qmail 17374 invoked by uid 500); 27 Feb 2015 01:35:05 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 17292 invoked by uid 500); 27 Feb 2015 01:35:05 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 17025 invoked by uid 99); 27 Feb 2015 01:35:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Feb 2015 01:35:05 +0000 Date: Fri, 27 Feb 2015 01:35:05 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-9805) LLAP: consider specialized "transient" metadata cache MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Sergey Shelukhin created HIVE-9805: -------------------------------------- Summary: LLAP: consider specialized "transient" metadata cache Key: HIVE-9805 URL: https://issues.apache.org/jira/browse/HIVE-9805 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Fix For: llap Due to the nature of cache now (metadata cache + disk cache), when data is read from ORC, whole bunch of processing is still done with metadata, columns, streams, contexts, offsets, etc. to get the data that is in cache. Essentially only the disk reads are eliminated, everything else is as if we are reading an unknown file. We could have a better metadata representation that is saved during first read - for example, (file, stripe) -> DiskRange[] (incl. cache buffers that are not locked) + multi-dimensional array per column per stream per RG pointing to offsets in DiskRange array. That way if such structure is found in cache, reader can avoid all the calculation and just do dumb conversion into results to pass to decoder plus disk reading for missing parts. This java cache cannot figure in the main data eviction policy so it should be small. With java objects no cache locking is needed, we can evict while someone is still using the structure, and it will be GCed -- This message was sent by Atlassian JIRA (v6.3.4#6332)