Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CB905200B0F for ; Fri, 17 Jun 2016 22:18:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CA2F5160A62; Fri, 17 Jun 2016 20:18:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1F31A160A4C for ; Fri, 17 Jun 2016 22:18:05 +0200 (CEST) Received: (qmail 67360 invoked by uid 500); 17 Jun 2016 20:18:05 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 67334 invoked by uid 99); 17 Jun 2016 20:18:05 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2016 20:18:05 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 36C5D2C14F8 for ; Fri, 17 Jun 2016 20:18:05 +0000 (UTC) Date: Fri, 17 Jun 2016 20:18:05 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Jun 2016 20:18:07 -0000 [ https://issues.apache.org/jira/browse/HIVE-13985?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1533= 6848#comment-15336848 ]=20 Sergey Shelukhin commented on HIVE-13985: ----------------------------------------- K +1 for branch-1... there should probably be RB for master > ORC improvements for reducing the file system calls in task side > ---------------------------------------------------------------- > > Key: HIVE-13985 > URL: https://issues.apache.org/jira/browse/HIVE-13985 > Project: Hive > Issue Type: Bug > Components: ORC > Affects Versions: 1.3.0, 2.2.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch= , HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, HIVE-13985-branch-2= .1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch, HIVE-13985.3.patch, HIVE-= 13985.4.patch > > > HIVE-13840 fixed some issues with addition file system invocations during= split generation. Similarly, this jira will fix issues with additional fil= e system invocations on the task side. To avoid reading footers on the task= side, users can set hive.orc.splits.include.file.footer to true which will= serialize the orc footers on the splits. But this has issues with serializ= ing unwanted information like column statistics and other metadata which ar= e not really required for reading orc split on the task side. We can reduce= the payload on the orc splits by serializing only the minimum required inf= ormation (stripe information, types, compression details). This will decrea= se the payload on the orc splits and can potentially avoid OOMs in applicat= ion master (AM) during split generation. This jira also address other issue= s concerning the AM cache. The local cache used by AM is soft reference cac= he. This can introduce unpredictability across multiple runs of the same qu= ery. We can cache the serialized footer in the local cache and also use str= ong reference cache which should avoid memory pressure and will have better= predictability. > One other improvement that we can do is when hive.orc.splits.include.file= .footer is set to false, on the task side we make one additional file syste= m call to know the size of the file. If we can serialize the file length in= the orc split this can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)