Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2097E17BA2 for ; Mon, 11 May 2015 04:11:01 +0000 (UTC) Received: (qmail 92288 invoked by uid 500); 11 May 2015 04:11:01 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 92267 invoked by uid 500); 11 May 2015 04:11:01 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 92239 invoked by uid 99); 11 May 2015 04:11:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2015 04:11:00 +0000 Date: Mon, 11 May 2015 04:11:00 +0000 (UTC) From: "Selina Zhang (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537540#comment-14537540 ] Selina Zhang commented on HIVE-10036: ------------------------------------- The above two unit test failures seem irrelevant to this patch. > Writing ORC format big table causes OOM - too many fixed sized stream buffers > ----------------------------------------------------------------------------- > > Key: HIVE-10036 > URL: https://issues.apache.org/jira/browse/HIVE-10036 > Project: Hive > Issue Type: Improvement > Reporter: Selina Zhang > Assignee: Selina Zhang > Labels: orcfile > Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch > > > ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). > Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. > I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. > Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)