orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Orc v2 Ideas
Date Tue, 09 Oct 2018 20:18:08 GMT

>  How small are you trying to make the stripes?  I ask because all of the above should
be small, so if they are dominating, I would assume the stripe is tiny or the compression
really worked well.

I'm not in favour of stripelets for seek reasons, because reading a single column from a remote
store is hit by the extra skipping over stripelet boundaries (or I read through the boundaries).

Flushing at fixed offsets across all columns would not suffer from that and would not change
the underlying read patterns.

There's already an "ORC gap cache" in LLAP to hack around the lack of these boundaries, but
something which I'd like to not keep around forever.

>  The ORC spec currently calls for sorted dictionaries, so if the they are not sorted,
they are not valid ORC files.  
>   I find that most dictionary are a relatively small size compared to the row count,
so the cost of testing each entry isn’t a big deal.

I agree, moving that out of the spec would be a good thing.

The format can add a future optional stream which is "sort-order-index" which contains the
dictionary transform from unsorted/sorted (i.e dict-ids in byte sorted order), so that the
reader can remap it into a sorted list.

But removing the "always sort" dictionaries would be a good thing for writer throughput and
memory consumption.


View raw message