orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiening Dai <xndai....@live.com>
Subject Re: Orc v2 Ideas
Date Tue, 09 Oct 2018 00:19:30 GMT

> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom <dain@iq80.com> wrote:
>> On Oct 6, 2018, at 11:42 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
>> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <dain@iq80.com> wrote:
>>> Interesting idea.  This could help some processors of the data.  Also, if
>>> the format has this, it would be good to support "clustered" and "unique"
>>> as flags for data that isn’t strictly sorted, but has all of the same
>>> values clustered together.  Then again, this seems like a property for the
>>> table/partition.
>> Clustered and unique are easy to measure while we are in dictionary mode,
>> but very hard while in direct mode. What are you thinking?
> I was thinking that if the writer knows that this information then it sets the flag,
otherwise the input is assumed to be unordered.

Yep, in our implementation sorting property is provided by the execution engine which emits
the tuples. Note that the sort keys can be more than one. Without this ordering information,
min/max stats are not able to handle multiple sorting keys scenarios.

>>>>> *   Breaking Compression Block and RLE Runs at Row Group Boundary
>>>>> Owen has mentioned this in previous discussion. We did a prototype and
>>> are
>>>>> able to show that there’s only a slight increase of file size (<
>>> with
>>>>> the change. But the benefit is obvious - all the seek to row group
>>>>> operation will not involve unnecessary decoding/decompression, making
>>>>> really efficient. And this is critical in scenarios such as predicate
>>>>> pushdown or range scan using clustered index (see my first bullet
>>> point).
>>>>> The other benefit is doing so will greatly simply the index
>>> implementation
>>>>> we have today. We will only need to record a file offset for row group
>>>>> index.
>>>> Yeah, this is the alternative to the stripelets that I discussed above.

If we go with smaller stripes, we run into problems like dictionary duplication, overhead
of metas/stats, cross stripes read, etc. Actually a couple of items I propose here is to make
sure the format still works great if we choose much smaller stripe size. As you mentioned,
sometimes smaller stripes are inevitable (e.g. under dynamic partition insert).

>>>>> *   Encoding and Compression
>>>>> The encoding today doesn’t have a lot of flexibility. Sometimes we
>>>>> need to configure and fine tune encoding when it’s needed. For example,
>>> in
>>>>> previous discussions Gang brought up, we found LEB128 causes zStd to
>>>>> perform really bad. We would end up with much better result by just
>>>>> disabling LEB128 under zstd compression. We don’t have flexibility
>>>>> these kind of things today. And we will need additional meta fields for
>>>>> that.
>>>> I certainly have had requests for custom encodings, but I've tended to
>>> push
>>>> back because it makes it hard to ensure the files are readable on all of
>>>> the platforms. I did just add the option to turn off dictionary encoding
>>>> for particular columns.
>>> Yep. As someone that maintains a reader/writer implementation, I would
>>> prefer to keep the variety of encodings down for the same reason :)
>>> As for flexibility, the dictionary encoding flag you mentioned wouldn’t
>>> effect the format, so it seems like a reasonable change to me.  One format
>>> level flexibility change, I’d like to see is the ability to not sort
>>> dictionaries, because no one is taking advantage of it, and it make it
>>> impossible to predict the output size of the stipe (sorting can make
>>> compression better or worse).
>> Absolutely. We probably should make that the default for ORC v2.
> If we keep this feature, I don’t see any reason to not enable it, since the reader
must still have all of the complex code.  My preference would be to require reset at row groups,
and then we remove that complex code.

We implemented a non-sort dictionary encoding using hash map. It performs better than the
sorted version. But have dictionary entires sorted can help runtime execution directly running
on the dictionary (some calls this lazy materialization). This can be a huge gain in a lot
of scenario. Does Presto have it?

>>> I guess that "breaking the compression" at row group boundaries could be
>>> done without format changes, but I’d prefer to see it required as it makes
>>> skipping a pain.
>>>> With respect to zstd, we need to test it under different data sets and
>>>> build up an understanding of when it works and when it doesn't. It sounds
>>>> like zstd with the options that you were using were a bad fit for what we
>>>> need. I would guess that longer windows and pre-loaded dictionaries may
>>>> help. We need more work to figure out what the right parameters are in
>>>> general by looking at more data sets.
>>> Totally agree.  My guess is it works good for thinks like timestamps and
>>> dates, but not great for varchar and binary.  Then again, if you are
>>> writing a lot of data, you could use the data from the previous stripes to
>>> speed up compression for later stripes.  My guess is that would be really
>>> complex to implement.
>>> If we decided we may want to pursue this path in the future, we could
>>> profile a "dictionary" section in the stream 
> Yep agree.  My thought is the win would have to be huge to justify the complexity.

I think my point here is we want to be able to config some of the encoding features. For example,
right now LEB128 is enforced for all integers, but it works bad with zstd. And the meta doesn’t
have a way to turn it off.

> -dain

View raw message