orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gang Wu" <gan...@alibaba-inc.com>
Subject ORC contribution from Alibaba
Date Wed, 26 Apr 2017 06:13:53 GMT
This is Gang from Alibaba working on Alibaba's big data platform - MaxCompute. We have developed
our own columnar storage format within MaxCompute to support MapReduce and other batch processing
workload. But as Apache Orc is getting popular in the industry, we are actively looking at
integrating Orc format into MaxCompute. 
In the past few months, Xiening (cc'ed) and I have been working on echancing Orc C++ to provide
full featured C++ reader and writer. Our work mainly involves adding a C++ writer that supports
all data types and stats, and supporting index for both reader and writer. As of today, we
have finished development and testing and plan to contribute this work back to the Apach Orc
project. We have communicated with Owen via email and have created an umbrella JIRA ORC-179 for
the plan. In brief, we plan to do the following:
  1. Refactor common classes for writer and reader
    -- extract common classes and functions for writer and reader to share
  2. OutputStream interface for writer
    -- implement several output streams for writing to memory, file, etc.
    -- implement ByteRleEncoder, RleEncoder, BooleanRleEncoder, etc.
    -- support zlib compression
  3. ORC Writer
    -- write orc file header, file footer, postscript, etc.
    -- write columns of all types 
    -- write column statistics
    -- write index stream in writer and reader seeks to row based on index information 
  4. other
    -- some minor bug fixes of current code base.

Should you have any question, please feel free to contact us. Any feedbacks and suggestions
are welcome. Thanks!
Gang WuSenior EngineerAlibaba Group
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message