orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xndai <...@git.apache.org>
Subject [GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Date Fri, 13 Apr 2018 23:38:47 GMT
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/247#discussion_r181530787
  
    --- Diff: site/specification/ORCv2.md ---
    @@ -0,0 +1,1032 @@
    +---
    +layout: page
    +title: Evolving Draft for ORC Specification v2
    +---
    +
    +This specification is rapidly evolving and should only be used for
    +developers on the project.
    +
    +# TO DO items
    --- End diff --
    
    Is this a final list of v2 or we are still working on it? I have one proposal to add to
ORC v2, which is what I call "clustered index". Basically the writer can specify a sorting
property on one or more columns, then we create an index section in ORC file with keys being
the column(s) value and the value is the row number. To reduce the size of index, each row
group has one entry in the clustered index. This will enable new range scan pattern when reader
provides upper bound and lower bound of column(s) values. 
    
    I can write up a detailed proposal for this.


---

Mime
View raw message