hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
Date Mon, 28 Mar 2016 19:27:25 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214728#comment-15214728
] 

Sergey Shelukhin edited comment on HIVE-13345 at 3/28/16 7:27 PM:
------------------------------------------------------------------

[~gopalv] [~prasanth_j] [~owen.omalley] opinions on the best approach? I am leaning towards
changing ORC to use POJOs instead of OrcProto stuff, but as an alternative we can change metadata
cache in LLAP to store serialized metadata. The cost of deserializing every time in LLAP vs
the cost of copying fields/converting some things (e.g. OrcProto stores bloom filters as List<Long>,
which aside from being horrible on purely practical grounds, offends my engineering sensibilities,
so I might be biased here).



was (Author: sershe):
[~gopalv] [~prasanth_j] [~owen.omalley] opinions on the best approach? I am leaning towards
changing ORC to use POJOs instead of OrcProto stuff, but as an alternative we can change metadata
cache in LLAP to store serialized metadata. The cost of deserializing every time in LLAP vs
the cost of copying fields/converting some things (e.g. OrcProto stores bloom filters as List<Long>,
which aside from being horrible on pure merits, offends my engineering sensibilities, so I
might be biased here).


> LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf
overhead
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13345
>                 URL: https://issues.apache.org/jira/browse/HIVE-13345
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe metadata takes
200-500Kb on real files, and with bloom filters blowing up more than x5 due to being stored
as list of Long-s, up to 5Mb per stripe. That is undesirable.
> We should either create better objects for ORC (might be good in general) or store serialized
metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message