orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Orc v2 Ideas
Date Sat, 06 Oct 2018 22:42:13 GMT


> On Oct 6, 2018, at 11:42 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <dain@iq80.com> wrote:
> 
>> 
>> Interesting idea.  This could help some processors of the data.  Also, if
>> the format has this, it would be good to support "clustered" and "unique"
>> as flags for data that isn’t strictly sorted, but has all of the same
>> values clustered together.  Then again, this seems like a property for the
>> table/partition.
>> 
> 
> Clustered and unique are easy to measure while we are in dictionary mode,
> but very hard while in direct mode. What are you thinking?

I was thinking that if the writer knows that this information then it sets the flag, otherwise
the input is assumed to be unordered.

>>>> *   Breaking Compression Block and RLE Runs at Row Group Boundary
>>>> 
>>>> Owen has mentioned this in previous discussion. We did a prototype and
>> are
>>>> able to show that there’s only a slight increase of file size (< 1%)
>> with
>>>> the change. But the benefit is obvious - all the seek to row group
>>>> operation will not involve unnecessary decoding/decompression, making it
>>>> really efficient. And this is critical in scenarios such as predicate
>>>> pushdown or range scan using clustered index (see my first bullet
>> point).
>>>> The other benefit is doing so will greatly simply the index
>> implementation
>>>> we have today. We will only need to record a file offset for row group
>>>> index.
>>>> 
>>> 
>>> Yeah, this is the alternative to the stripelets that I discussed above.
>>> 
>>>> *   Encoding and Compression
>>>> 
>>>> The encoding today doesn’t have a lot of flexibility. Sometimes we would
>>>> need to configure and fine tune encoding when it’s needed. For example,
>> in
>>>> previous discussions Gang brought up, we found LEB128 causes zStd to
>>>> perform really bad. We would end up with much better result by just
>>>> disabling LEB128 under zstd compression. We don’t have flexibility for
>>>> these kind of things today. And we will need additional meta fields for
>>>> that.
>>>> 
>>> 
>>> I certainly have had requests for custom encodings, but I've tended to
>> push
>>> back because it makes it hard to ensure the files are readable on all of
>>> the platforms. I did just add the option to turn off dictionary encoding
>>> for particular columns.
>> 
>> Yep. As someone that maintains a reader/writer implementation, I would
>> prefer to keep the variety of encodings down for the same reason :)
>> 
>> As for flexibility, the dictionary encoding flag you mentioned wouldn’t
>> effect the format, so it seems like a reasonable change to me.  One format
>> level flexibility change, I’d like to see is the ability to not sort
>> dictionaries, because no one is taking advantage of it, and it make it
>> impossible to predict the output size of the stipe (sorting can make
>> compression better or worse).
>> 
> 
> Absolutely. We probably should make that the default for ORC v2.

If we keep this feature, I don’t see any reason to not enable it, since the reader must
still have all of the complex code.  My preference would be to require reset at row groups,
and then we remove that complex code.

>> I guess that "breaking the compression" at row group boundaries could be
>> done without format changes, but I’d prefer to see it required as it makes
>> skipping a pain.
>> 
>>> With respect to zstd, we need to test it under different data sets and
>>> build up an understanding of when it works and when it doesn't. It sounds
>>> like zstd with the options that you were using were a bad fit for what we
>>> need. I would guess that longer windows and pre-loaded dictionaries may
>>> help. We need more work to figure out what the right parameters are in
>>> general by looking at more data sets.
>> 
>> Totally agree.  My guess is it works good for thinks like timestamps and
>> dates, but not great for varchar and binary.  Then again, if you are
>> writing a lot of data, you could use the data from the previous stripes to
>> speed up compression for later stripes.  My guess is that would be really
>> complex to implement.
>> 
>> If we decided we may want to pursue this path in the future, we could
>> profile a "dictionary" section in the stream 

Yep agree.  My thought is the win would have to be huge to justify the complexity.

-dain


Mime
View raw message