orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yurui Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-469) Add EncodedStringVectorBatch to expose string dictionary in VectorBatch
Date Wed, 13 Feb 2019 08:17:00 GMT
Yurui Zhou created ORC-469:
------------------------------

             Summary: Add EncodedStringVectorBatch to expose string dictionary in VectorBatch
                 Key: ORC-469
                 URL: https://issues.apache.org/jira/browse/ORC-469
             Project: ORC
          Issue Type: Improvement
          Components: Reader
    Affects Versions: 1.5.5, 1.6.0, 2.0.0
            Reporter: Yurui Zhou


Propose to add EncodedStringVectorBatch to expose string dictionary in VectorBatch. Exposing
string dictionary will benefits as bellow:
 * Enable computation over encoded data. By exposing dictionary in vector batch we would be
able to implement filter operator in a more efficient way. In our POC, by enabling encoded
data based filter, we achieved 8% E2E perf improvement.
 * Make data serialization more efficient. Currently when serializing Orc Vectorbatch, we
have to copy all the strings in a vector even though the string data is already dictionary
encoded. Exposing String Dictionary will enable vector batch serializer to remove unnecessary
string memcpy, which will greatly improve serialization efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message