orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Drome (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-299) Improve heuristics for bailing on dictionary encoding
Date Thu, 08 Feb 2018 00:22:00 GMT
Chris Drome created ORC-299:
-------------------------------

             Summary: Improve heuristics for bailing on dictionary encoding
                 Key: ORC-299
                 URL: https://issues.apache.org/jira/browse/ORC-299
             Project: ORC
          Issue Type: Improvement
            Reporter: Chris Drome


Recently a user ran into the following failure:

{noformat}

Caused by: java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at
org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115) at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
at org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55) at
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250) at
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797) at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110) at
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536)
... 18 more

{noformat}

 

I tracked this down to the following in DynamicByteArray.java, which is being used to create
the dictionary for a particular column:

{noformat}

private int length;

{noformat}

 

This has the side-effect of capping the memory available for the dictionary at 2GB.

 

Given the size of column values in this use case, and the fact that the user is exceeding
this 2GB limit, there should probably be some heuristics that bail early on dictionary creation,
so this limitation is never reached. Given the size of data that would be required to hit
this limit, it is unlikely that a dictionary would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message