orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yurui Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-469) Add EncodedStringVectorBatch to expose string dictionary in VectorBatch
Date Wed, 13 Feb 2019 08:17:00 GMT
Yurui Zhou created ORC-469:

             Summary: Add EncodedStringVectorBatch to expose string dictionary in VectorBatch
                 Key: ORC-469
                 URL: https://issues.apache.org/jira/browse/ORC-469
             Project: ORC
          Issue Type: Improvement
          Components: Reader
    Affects Versions: 1.5.5, 1.6.0, 2.0.0
            Reporter: Yurui Zhou

Propose to add EncodedStringVectorBatch to expose string dictionary in VectorBatch. Exposing
string dictionary will benefits as bellow:
 * Enable computation over encoded data. By exposing dictionary in vector batch we would be
able to implement filter operator in a more efficient way. In our POC, by enabling encoded
data based filter, we achieved 8% E2E perf improvement.
 * Make data serialization more efficient. Currently when serializing Orc Vectorbatch, we
have to copy all the strings in a vector even though the string data is already dictionary
encoded. Exposing String Dictionary will enable vector batch serializer to remove unnecessary
string memcpy, which will greatly improve serialization efficiency.

This message was sent by Atlassian JIRA

View raw message