orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Orc v2 Ideas
Date Mon, 01 Oct 2018 22:55:52 GMT
On Oct 1, 2018, at 2:30 PM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> On Fri, Sep 28, 2018 at 2:40 PM Xiening Dai <xndai.git@live.com> wrote:
>> Hi all,
>> While we are working on the new Orc v2 spec, I want to bounce some ideas
>> in this group. If we can get something concrete, I will open JIRAs to
>> follow up. Some of these ideas were mentioned before in various discussion,
>> but I just put them together in a list so people can comment and provide
>> feedback. Thanks.
>>  *   Clustered Index
>> In a lot of time, data written into Orc file has sorting property. For
>> example a sort merge join output will result in the data stream being
>> sorted on join key(s). Another example is the DISTRIBUTED BY … SORTED BY …
>> keywords can enforce the sorting property on certain data set. Under such
>> cases, if we can just record the sort key(s) values for the first row of
>> each row group it will help us a lot while doing lookup using key ranges,
>> because we already have row group index which gives us ~O(1) seek time.
>> During query execution, a SQL filter predicate can be easily turned into a
>> key range. For example “WHERE id > 0 and id <= 100” will be translated
>> range (0, 100], and this key range can be passed down all the way to the
>> Orc reader. Then we only need to load the corresponding row groups that
>> covers this range.
> Dain covered this already, but the min/max should already do this.
> One piece that we should add is automatically recording whether the data is
> strictly ascending or descending.

Interesting idea.  This could help some processors of the data.  Also, if the format has this,
it would be good to support "clustered" and "unique" as flags for data that isn’t strictly
sorted, but has all of the same values clustered together.  Then again, this seems like a
property for the table/partition.

>>  *   Stripe Footer Location
>> Today stripe footers are stored at the end of each stripe. This design
>> probably come from the Hive world where the implementation tries to align
>> Orc stripe with an HDFS block. It would make sense when you only need to
>> read one HDFS block for both the data and the footer. But the alignment
>> assumption doesn’t hold in other systems that leverage Orc as a columnar
>> data format. Besides even for Hive, often time it’s hard to make sure good
>> alignment due to various reasons - for example, when memory pressure is
>> high stripe needs to be flushed to disk earlier. With this in mind, it will
>> make sense to support saving stripe footer at the end of the file, together
>> with all the other file meta. This would be easier for one sequential IO to
>> load all the meta, and is easier to cache them all together. And we can
>> make this configurable through writer options.
> Some of the systems like S3 and Wasabi have a strong preference for reading
> forward in files, so I think it is reasonable to put the stripe "footer"
> first in the file, followed by the indexes, and finally the data. Putting
> the "footer" first isn't hard for the implementation since it has to buffer
> the entire stripe before anything is 

On a per-stripe basis, the footer, data and index could be in any order, because any implementation
will need to completely buffer the stripe somewhere (it is just a reality of columnar writing).
 Across the whole file, that is a completely different story.

> One of the ideas that I've been playing with is to make "stripelets" where
> each one contains 128k rows and flush the streams at that point. That would
> enable you to read the first stripelet and start processing while you read
> the next stripelet.

What is the advantage of a "stripelet" over just flushing the stripe every 128k rows? 

>>  *   File Level Dictionary
>> Currently Orc builds dictionary at stripe level. Each stripe has its own
>> dictionary. But in most cases, data across stripes share a lot of
>> similarities. Building one file level dictionary is probably more efficient
>> than having one dictionary each stripe. We can reduce storage footprint and
>> also improve read performance since we only have one dictionary per column
>> per file. One challenge with this design is how to do file merge. Two files
>> can have two different dictionary, and we need to be able to merge them
>> without rewriting all the data. To solve this problem, we will need to
>> support multiple dictionaries identified by uuid. Each stripe records the
>> dictionary ID that identifies the dictionary it uses. And the reader loads
>> the particular dictionary based on the ID when it loads a stripe. When you
>> merge two files, dictionary data doesn’t need to be changed, but just to
>> save the dictionaries from both files in the new merged file.
> I haven't seen the dictionary duplication being a problem. On the other
> hand, having cross-block reads, which a global dictionary would require,
> are painful. As you point out, stripe merging become much more complicated
> in this case. (You don't need UUID schemes, because the writer would always
> know the mapping of which stripes & columns are using which dictionary.)

To be honest, I rarely see cross stripe/block reads.  In Presto, people normally want to be
as a parallel as possible, so effectively every strip is read by a different threads/machines.
 Of course there are corner cases, where a writer went crazy and wrote tiny stripes, but these
are the exception.

>>  *   Breaking Compression Block and RLE Runs at Row Group Boundary
>> Owen has mentioned this in previous discussion. We did a prototype and are
>> able to show that there’s only a slight increase of file size (< 1%) with
>> the change. But the benefit is obvious - all the seek to row group
>> operation will not involve unnecessary decoding/decompression, making it
>> really efficient. And this is critical in scenarios such as predicate
>> pushdown or range scan using clustered index (see my first bullet point).
>> The other benefit is doing so will greatly simply the index implementation
>> we have today. We will only need to record a file offset for row group
>> index.
> Yeah, this is the alternative to the stripelets that I discussed above.
>>  *   Encoding and Compression
>> The encoding today doesn’t have a lot of flexibility. Sometimes we would
>> need to configure and fine tune encoding when it’s needed. For example, in
>> previous discussions Gang brought up, we found LEB128 causes zStd to
>> perform really bad. We would end up with much better result by just
>> disabling LEB128 under zstd compression. We don’t have flexibility for
>> these kind of things today. And we will need additional meta fields for
>> that.
> I certainly have had requests for custom encodings, but I've tended to push
> back because it makes it hard to ensure the files are readable on all of
> the platforms. I did just add the option to turn off dictionary encoding
> for particular columns.

Yep. As someone that maintains a reader/writer implementation, I would prefer to keep the
variety of encodings down for the same reason :)

As for flexibility, the dictionary encoding flag you mentioned wouldn’t effect the format,
so it seems like a reasonable change to me.  One format level flexibility change, I’d like
to see is the ability to not sort dictionaries, because no one is taking advantage of it,
and it make it impossible to predict the output size of the stipe (sorting can make compression
better or worse).

I guess that "breaking the compression" at row group boundaries could be done without format
changes, but I’d prefer to see it required as it makes skipping a pain.

> With respect to zstd, we need to test it under different data sets and
> build up an understanding of when it works and when it doesn't. It sounds
> like zstd with the options that you were using were a bad fit for what we
> need. I would guess that longer windows and pre-loaded dictionaries may
> help. We need more work to figure out what the right parameters are in
> general by looking at more data sets.

Totally agree.  My guess is it works good for thinks like timestamps and dates, but not great
for varchar and binary.  Then again, if you are writing a lot of data, you could use the data
from the previous stripes to speed up compression for later stripes.  My guess is that would
be really complex to implement.

If we decided we may want to pursue this path in the future, we could profile a "dictionary"
section in the stream information.

View raw message