Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C60FD200B0F for ; Fri, 17 Jun 2016 22:08:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C49F4160A62; Fri, 17 Jun 2016 20:08:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 189AE160A4C for ; Fri, 17 Jun 2016 22:08:05 +0200 (CEST) Received: (qmail 30865 invoked by uid 500); 17 Jun 2016 20:08:05 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 30853 invoked by uid 99); 17 Jun 2016 20:08:05 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2016 20:08:05 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 353EB2C14F8 for ; Fri, 17 Jun 2016 20:08:05 +0000 (UTC) Date: Fri, 17 Jun 2016 20:08:05 +0000 (UTC) From: "Prasanth Jayachandran (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Jun 2016 20:08:07 -0000 [ https://issues.apache.org/jira/browse/HIVE-13985?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1533= 6825#comment-15336825 ]=20 Prasanth Jayachandran commented on HIVE-13985: ---------------------------------------------- OrcTail preserves the entire serialized footer from which it derives metada= ta lazily. RB patch is for branch-1 only which does not have to have deal w= ith metastore cache. Patch for master is where I fixed the metastore cache = test failure. As I said before I am not going to commit to master until HIV= E-14007. I have just uploaded the patch to kick off pre-commit test run. > ORC improvements for reducing the file system calls in task side > ---------------------------------------------------------------- > > Key: HIVE-13985 > URL: https://issues.apache.org/jira/browse/HIVE-13985 > Project: Hive > Issue Type: Bug > Components: ORC > Affects Versions: 1.3.0, 2.2.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch= , HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, HIVE-13985-branch-2= .1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch, HIVE-13985.3.patch, HIVE-= 13985.4.patch > > > HIVE-13840 fixed some issues with addition file system invocations during= split generation. Similarly, this jira will fix issues with additional fil= e system invocations on the task side. To avoid reading footers on the task= side, users can set hive.orc.splits.include.file.footer to true which will= serialize the orc footers on the splits. But this has issues with serializ= ing unwanted information like column statistics and other metadata which ar= e not really required for reading orc split on the task side. We can reduce= the payload on the orc splits by serializing only the minimum required inf= ormation (stripe information, types, compression details). This will decrea= se the payload on the orc splits and can potentially avoid OOMs in applicat= ion master (AM) during split generation. This jira also address other issue= s concerning the AM cache. The local cache used by AM is soft reference cac= he. This can introduce unpredictability across multiple runs of the same qu= ery. We can cache the serialized footer in the local cache and also use str= ong reference cache which should avoid memory pressure and will have better= predictability. > One other improvement that we can do is when hive.orc.splits.include.file= .footer is set to false, on the task side we make one additional file syste= m call to know the size of the file. If we can serialize the file length in= the orc split this can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)