orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: Orc v2 Ideas
Date Tue, 09 Oct 2018 19:46:58 GMT
> On Oct 8, 2018, at 5:19 PM, Xiening Dai <xndai.git@live.com> wrote:
>> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom <dain@iq80.com> wrote:
>>> On Oct 6, 2018, at 11:42 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
>>> 
>>> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <dain@iq80.com> wrote:
>>>>>> *   Breaking Compression Block and RLE Runs at Row Group Boundary
>>>>>> 
>>>>>> Owen has mentioned this in previous discussion. We did a prototype
and
>>>> are
>>>>>> able to show that there’s only a slight increase of file size (<
1%)
>>>> with
>>>>>> the change. But the benefit is obvious - all the seek to row group
>>>>>> operation will not involve unnecessary decoding/decompression, making
it
>>>>>> really efficient. And this is critical in scenarios such as predicate
>>>>>> pushdown or range scan using clustered index (see my first bullet
>>>> point).
>>>>>> The other benefit is doing so will greatly simply the index
>>>> implementation
>>>>>> we have today. We will only need to record a file offset for row
group
>>>>>> index.
>>>>>> 
>>>>> 
>>>>> Yeah, this is the alternative to the stripelets that I discussed above.
> 
> 
> If we go with smaller stripes, we run into problems like dictionary duplication, overhead
of metas/stats, cross stripes read, etc. Actually a couple of items I propose here is to make
sure the format still works great if we choose much smaller stripe size. As you mentioned,
sometimes smaller stripes are inevitable (e.g. under dynamic partition insert).

How small are you trying to make the stripes?  I ask because all of the above should be small,
so if they are dominating, I would assume the stripe is tiny or the compression really worked
well.

>>>>> 
>>>>>> *   Encoding and Compression
>>>>>> 
>>>>>> The encoding today doesn’t have a lot of flexibility. Sometimes
we would
>>>>>> need to configure and fine tune encoding when it’s needed. For
example,
>>>> in
>>>>>> previous discussions Gang brought up, we found LEB128 causes zStd
to
>>>>>> perform really bad. We would end up with much better result by just
>>>>>> disabling LEB128 under zstd compression. We don’t have flexibility
for
>>>>>> these kind of things today. And we will need additional meta fields
for
>>>>>> that.
>>>>>> 
>>>>> 
>>>>> I certainly have had requests for custom encodings, but I've tended to
>>>> push
>>>>> back because it makes it hard to ensure the files are readable on all
of
>>>>> the platforms. I did just add the option to turn off dictionary encoding
>>>>> for particular columns.
>>>> 
>>>> Yep. As someone that maintains a reader/writer implementation, I would
>>>> prefer to keep the variety of encodings down for the same reason :)
>>>> 
>>>> As for flexibility, the dictionary encoding flag you mentioned wouldn’t
>>>> effect the format, so it seems like a reasonable change to me.  One format
>>>> level flexibility change, I’d like to see is the ability to not sort
>>>> dictionaries, because no one is taking advantage of it, and it make it
>>>> impossible to predict the output size of the stipe (sorting can make
>>>> compression better or worse).
>>>> 
>>> 
>>> Absolutely. We probably should make that the default for ORC v2.
>> 
>> If we keep this feature, I don’t see any reason to not enable it, since the reader
must still have all of the complex code.  My preference would be to require reset at row groups,
and then we remove that complex code.
> 
> We implemented a non-sort dictionary encoding using hash map. It performs better than
the sorted version. But have dictionary entires sorted can help runtime execution directly
running on the dictionary (some calls this lazy materialization). This can be a huge gain
in a lot of scenario. Does Presto have it?

The ORC spec currently calls for sorted dictionaries, so if the they are not sorted, they
are not valid ORC files.  

I’m not sure what you exactly mean by "lazy materialization", but Presto has done lazy materialization
since the first version of the ORC reader.  Also, Presto has always understood and optimized
for dictionary and run length encoded data.  The only think that I can see a sorted dictionary
helping with is optimizing exact or range filters (making then O(1) instead of O(dictionaryCount)).
 I find that most dictionary are a relatively small size compared to the row count, so the
cost of testing each entry isn’t a big deal.

-dain
Mime
View raw message