orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "周宇睿(闻拙)" <yurui....@alibaba-inc.com>
Subject Propose to add EncodedStringVectorBatch to expose string dictionary
Date Wed, 13 Feb 2019 08:37:59 GMT
Hi All,


Currently the Orc Reader StringVectorBatch does not bring any information about its encoding
information, while the string dictionary can bring great benefits in various situation. I
would like to add an EncodedStringVectorBatch to Orc Reader that expose the string dictionary
(if available) to external consumer. The string dictionary will following benefits:
Enable computation over encoded data. By exposing dictionary in vector batch we would be able
to implement filter operator in a more efficient way. In our POC, by enabling encoded data
based filter, we achieved 8% E2E perf improvement on tpch q1.
Make data serialization more efficient. Currently when serializing Orc Vectorbatch, we have
to copy all the strings in a vector even though the string data is already dictionary encoded.
Exposing String Dictionary will enable vector batch serializer to remove unnecessary string
memcpy, which will greatly improve serialization efficiency

I opened a Jira at https://jira.apache.org/jira/browse/ORC-469


Any thoughts?





  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message