orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Orc v2 Ideas
Date Mon, 01 Oct 2018 18:51:52 GMT

> On Sep 28, 2018, at 2:40 PM, Xiening Dai <xndai.git@live.com> wrote:
> Hi all,
> While we are working on the new Orc v2 spec, I want to bounce some ideas in this group.
If we can get something concrete, I will open JIRAs to follow up. Some of these ideas were
mentioned before in various discussion, but I just put them together in a list so people can
comment and provide feedback. Thanks.
>  *   Clustered Index
> In a lot of time, data written into Orc file has sorting property. For example a sort
merge join output will result in the data stream being sorted on join key(s). Another example
is the DISTRIBUTED BY … SORTED BY … keywords can enforce the sorting property on certain
data set. Under such cases, if we can just record the sort key(s) values for the first row
of each row group it will help us a lot while doing lookup using key ranges, because we already
have row group index which gives us ~O(1) seek time. During query execution, a SQL filter
predicate can be easily turned into a key range. For example “WHERE id > 0 and id <=
100” will be translated into range (0, 100], and this key range can be passed down all the
way to the Orc reader. Then we only need to load the corresponding row groups that covers
this range.

Don’t we already have min and max for every row group?  If so, isn’t that a superset of
this feature?

>  *   Stripe Footer Location
> Today stripe footers are stored at the end of each stripe. This design probably come
from the Hive world where the implementation tries to align Orc stripe with an HDFS block.
It would make sense when you only need to read one HDFS block for both the data and the footer.
But the alignment assumption doesn’t hold in other systems that leverage Orc as a columnar
data format. Besides even for Hive, often time it’s hard to make sure good alignment due
to various reasons - for example, when memory pressure is high stripe needs to be flushed
to disk earlier. With this in mind, it will make sense to support saving stripe footer at
the end of the file, together with all the other file meta. This would be easier for one sequential
IO to load all the meta, and is easier to cache them all together. And we can make this configurable
through writer options.

The end of the file contains a metadata section with a summary of the stripe statistics. 
I’m curious what information you would like that isn’t already present that data structure.

Also, I’m curious how these other  systems "that leverage Orc as a columnar data format"
are accessing this stripe footer information.  Specifically, is it cached or loaded on demand
from storage?  Are you using this for planning or data skipping?

>  *   File Level Dictionary
> Currently Orc builds dictionary at stripe level. Each stripe has its own dictionary.
But in most cases, data across stripes share a lot of similarities. Building one file level
dictionary is probably more efficient than having one dictionary each stripe. We can reduce
storage footprint and also improve read performance since we only have one dictionary per
column per file. One challenge with this design is how to do file merge. Two files can have
two different dictionary, and we need to be able to merge them without rewriting all the data.
To solve this problem, we will need to support multiple dictionaries identified by uuid. Each
stripe records the dictionary ID that identifies the dictionary it uses. And the reader loads
the particular dictionary based on the ID when it loads a stripe. When you merge two files,
dictionary data doesn’t need to be changed, but just to save the dictionaries from both
files in the new merged file.

In my experience, this over head is very small when using the default 128MB stripe settings,
and I’d guess is  reasonable at 64MB or 32MB.  What we typically see is that dictionary
columns compress amazingly well, and the other columns in the table take up the majority of
the space in a table.  Even when you include the repeated cost of the dictionary per stripe,
the over head is tiny.

On the other hand, there are cases where you have columns that have a small common set of
values mixed in with pretty distinct values (skewed data), and in those cases the dictionary
blows up as you add more rows to the stripe.  The FB fork of ORC, DWRF, addresses this by
having support for a "per row group" dictionary.  Another, alternative is to support mixed
direct and dictionary in the same column, but that is pretty complex to implement and effectively
disables downstream dictionary processing.

BTW, I’m not advocating for either of these dictionary changes, just providing my observations.

>  *   Breaking Compression Block and RLE Runs at Row Group Boundary
> Owen has mentioned this in previous discussion. We did a prototype and are able to show
that there’s only a slight increase of file size (< 1%) with the change. But the benefit
is obvious - all the seek to row group operation will not involve unnecessary decoding/decompression,
making it really efficient. And this is critical in scenarios such as predicate pushdown or
range scan using clustered index (see my first bullet point). The other benefit is doing so
will greatly simply the index implementation we have today. We will only need to record a
file offset for row group index.

This is one I’d like to see.  The complexity of encodings spanning row groups, makes skipping
super complex. Would you extend this to compression blocks?

BTW, there are datasets that would become much bigger with this change, but I think on average
the change would be tiny.

>  *   Encoding and Compression
> The encoding today doesn’t have a lot of flexibility. Sometimes we would need to configure
and fine tune encoding when it’s needed. For example, in previous discussions Gang brought
up, we found LEB128 causes zStd to perform really bad. We would end up with much better result
by just disabling LEB128 under zstd compression. We don’t have flexibility for these kind
of things today. And we will need additional meta fields for that.

I think this brings up a much bigger issue.  Why have such flexibility in the compression
algorithms?  The interaction between the block compression and the data encodings, can have
dramatic effects on the format.  Can we consider limiting the compression to LZ4 and ZSTD
(or may be just ZSTD), and then design encodings that play well with them?  Also, ZSTD can
have pre-trained "dictionary" that might help with specific encodings…. Just a thought.


View raw message