hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Selina Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
Date Fri, 20 Mar 2015 18:36:39 GMT
Selina Zhang created HIVE-10036:
-----------------------------------

             Summary: Writing ORC format big table causes OOM - too many fixed sized stream
buffers
                 Key: HIVE-10036
                 URL: https://issues.apache.org/jira/browse/HIVE-10036
             Project: Hive
          Issue Type: Improvement
            Reporter: Selina Zhang
            Assignee: Selina Zhang


ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed
size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable.
Specially when HCatalog dynamic partition involves, several hundreds files may be open and
writing at the same time (same problems for FileSinkOperator). 

Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows
interval. An enhancement could be done here, but the problem is reducing the buffer size introduces
worse compression and more IOs in read path. Sacrificing the read performance is always not
a good choice. 

I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing
configurable buffer size. Most of the streams does not need large buffer so the performance
got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance
gain with this fix. 

Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging
fruit. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message