orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiening Dai <xndai....@live.com>
Subject Orc v2 Ideas
Date Fri, 28 Sep 2018 21:40:30 GMT
Hi all,

While we are working on the new Orc v2 spec, I want to bounce some ideas in this group. If
we can get something concrete, I will open JIRAs to follow up. Some of these ideas were mentioned
before in various discussion, but I just put them together in a list so people can comment
and provide feedback. Thanks.

  *   Clustered Index

In a lot of time, data written into Orc file has sorting property. For example a sort merge
join output will result in the data stream being sorted on join key(s). Another example is
the DISTRIBUTED BY … SORTED BY … keywords can enforce the sorting property on certain
data set. Under such cases, if we can just record the sort key(s) values for the first row
of each row group it will help us a lot while doing lookup using key ranges, because we already
have row group index which gives us ~O(1) seek time. During query execution, a SQL filter
predicate can be easily turned into a key range. For example “WHERE id > 0 and id <=
100” will be translated into range (0, 100], and this key range can be passed down all the
way to the Orc reader. Then we only need to load the corresponding row groups that covers
this range.

  *   Stripe Footer Location

Today stripe footers are stored at the end of each stripe. This design probably come from
the Hive world where the implementation tries to align Orc stripe with an HDFS block. It would
make sense when you only need to read one HDFS block for both the data and the footer. But
the alignment assumption doesn’t hold in other systems that leverage Orc as a columnar data
format. Besides even for Hive, often time it’s hard to make sure good alignment due to various
reasons - for example, when memory pressure is high stripe needs to be flushed to disk earlier.
With this in mind, it will make sense to support saving stripe footer at the end of the file,
together with all the other file meta. This would be easier for one sequential IO to load
all the meta, and is easier to cache them all together. And we can make this configurable
through writer options.

  *   File Level Dictionary

Currently Orc builds dictionary at stripe level. Each stripe has its own dictionary. But in
most cases, data across stripes share a lot of similarities. Building one file level dictionary
is probably more efficient than having one dictionary each stripe. We can reduce storage footprint
and also improve read performance since we only have one dictionary per column per file. One
challenge with this design is how to do file merge. Two files can have two different dictionary,
and we need to be able to merge them without rewriting all the data. To solve this problem,
we will need to support multiple dictionaries identified by uuid. Each stripe records the
dictionary ID that identifies the dictionary it uses. And the reader loads the particular
dictionary based on the ID when it loads a stripe. When you merge two files, dictionary data
doesn’t need to be changed, but just to save the dictionaries from both files in the new
merged file.

  *   Breaking Compression Block and RLE Runs at Row Group Boundary

Owen has mentioned this in previous discussion. We did a prototype and are able to show that
there’s only a slight increase of file size (< 1%) with the change. But the benefit is
obvious - all the seek to row group operation will not involve unnecessary decoding/decompression,
making it really efficient. And this is critical in scenarios such as predicate pushdown or
range scan using clustered index (see my first bullet point). The other benefit is doing so
will greatly simply the index implementation we have today. We will only need to record a
file offset for row group index.

  *   Encoding and Compression

The encoding today doesn’t have a lot of flexibility. Sometimes we would need to configure
and fine tune encoding when it’s needed. For example, in previous discussions Gang brought
up, we found LEB128 causes zStd to perform really bad. We would end up with much better result
by just disabling LEB128 under zstd compression. We don’t have flexibility for these kind
of things today. And we will need additional meta fields for that.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message